I possess a batch of files in CSV format (approximately 100), each containing 20,000 data entries (all located in the first column). My objective is to extract the first column of data from all these CSV files and consolidate them into a specified text document, ensuring the removal of any duplicate content. Batch merge and duplicate removal.
Method Enumeration
- Utilize text editing software or Excel to individually open each CSV file and copy the first column’s content into a text document;
- Employ dedicated data processing tools such as
Tableau
orAlteryx
; - Use batch processing (batch) or shell scripting (PowerShell);
- Implement Python scripting.
I have ruled out methods 1 and 2. Although I attempted batch processing and shell scripting, I abandoned them due to slow performance. Consequently, I have opted to use Python for bulk merging.
Step-by-Step Analysis
- Read each CSV file: Open each CSV file with Python and read the data from the first column.
- Merge data: Combine the first column data from each file into a single list or array.
- Write to a text file: Transfer the merged data into a new text document.
Python Script
import csv
import glob
# Directory where the CSV file is located
csv_folder = 'path/to/your/csv/files'
# Name of the output file
output_file = 'merged_data.txt'
# Find all CSV files in the directory
csv_files = glob.glob(csv_folder + '/*.csv')
# Open the output file
with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
for file in csv_files:
with open(file, 'r', newline='', encoding='utf-8') as infile:
reader = csv.reader(infile)
# Reads and writes to the first column of each file
for row in reader:
if row: # Ensure that rows are not empty
outfile.write(row[0] + '\n')
print("The first columns of all documents have been merged into>>>", output_file)
Code revised on February 28, 2024
Code Remarks
- CSV File Path: Ensure the
csv_folder
variable accurately points to the directory where your CSV files are located. - CSV File Format: Confirm that all CSV files have a consistent format, particularly regarding the first column’s data.
- Python Environment: Make sure Python3 is installed on your system.
- Running the Script: Use the command
python3 script.py
in the terminal to execute the script. - Modifying File Permissions: If you encounter permission issues, use the command
chmod +x script.py
to grant executable permissions. - Batch merge done.
Eliminating Duplicate Content
After executing the above Python3 script, a file named “merged_data.txt
” will be generated. I am uncertain if this text file, containing 2 million data entries, has duplicate content. Therefore, to remove duplicate lines, I will continue using Python.
Step-by-Step Analysis
- Read the text file.
- Create a set (or use a list combined with the
set
function to remove duplicates). - Write the de-duplicated content to a new text file.
Python Script
# Name of the output file
input_file = 'merged_data.txt'
output_file = 'merged_data_unique.txt'
# Reads a file and removes duplicate lines
with open(input_file, 'r', encoding='utf-8') as infile:
unique_lines = set(infile.readlines())
# Write the de-duplicated content back to the new file
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in sorted(unique_lines):
outfile.write(line)
print(f"Duplicate rows have been removed from the {input_file},save as {output_file}。")
Notes
- Read all lines from the
merged_data.txt
file, employing aset
to automatically eliminate duplicate lines. - Write the de-duplicated content into a new file,
merged_data_unique.txt
. - To maintain the order of the files, I used the
sorted
function to arrange the elements in the set. Should sorting be unnecessary, you may omit thesorted
function.
Leave a Comment