Batch merge: Efficient Duplicate Removal from Multiple CSV Files

I possess a batch of files in CSV format (approximately 100), each containing 20,000 data entries (all located in the first column). My objective is to extract the first column of data from all these CSV files and consolidate them into a specified text document, ensuring the removal of any duplicate content. Batch merge and duplicate removal.

Method Enumeration

Utilize text editing software or Excel to individually open each CSV file and copy the first column’s content into a text document;
Employ dedicated data processing tools such as Tableau or Alteryx;
Use batch processing (batch) or shell scripting (PowerShell);
Implement Python scripting.
I have ruled out methods 1 and 2. Although I attempted batch processing and shell scripting, I abandoned them due to slow performance. Consequently, I have opted to use Python for bulk merging.

Step-by-Step Analysis

Read each CSV file: Open each CSV file with Python and read the data from the first column.
Merge data: Combine the first column data from each file into a single list or array.
Write to a text file: Transfer the merged data into a new text document.

Python Script


import csv
import glob

# Directory where the CSV file is located
csv_folder = 'path/to/your/csv/files'

# Name of the output file
output_file = 'merged_data.txt'

# Find all CSV files in the directory
csv_files = glob.glob(csv_folder + '/*.csv')

# Open the output file
with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
    for file in csv_files:
        with open(file, 'r', newline='', encoding='utf-8') as infile:
            reader = csv.reader(infile)
            # Reads and writes to the first column of each file
            for row in reader:
                if row: # Ensure that rows are not empty
                    outfile.write(row[0] + '\n')

print("The first columns of all documents have been merged into>>>", output_file)

Code revised on February 28, 2024

Code Remarks

CSV File Path: Ensure the csv_folder variable accurately points to the directory where your CSV files are located.
CSV File Format: Confirm that all CSV files have a consistent format, particularly regarding the first column’s data.
Python Environment: Make sure Python3 is installed on your system.
Running the Script: Use the command python3 script.py in the terminal to execute the script.
Modifying File Permissions: If you encounter permission issues, use the command chmod +x script.py to grant executable permissions.
Batch merge done.

Eliminating Duplicate Content

After executing the above Python3 script, a file named “merged_data.txt” will be generated. I am uncertain if this text file, containing 2 million data entries, has duplicate content. Therefore, to remove duplicate lines, I will continue using Python.

Step-by-Step Analysis

Read the text file.
Create a set (or use a list combined with the set function to remove duplicates).
Write the de-duplicated content to a new text file.

Python Script

# Name of the output file
input_file = 'merged_data.txt'
output_file = 'merged_data_unique.txt'

# Reads a file and removes duplicate lines
with open(input_file, 'r', encoding='utf-8') as infile:
    unique_lines = set(infile.readlines())

# Write the de-duplicated content back to the new file
with open(output_file, 'w', encoding='utf-8') as outfile:
    for line in sorted(unique_lines):
        outfile.write(line)

print(f"Duplicate rows have been removed from the {input_file}，save as {output_file}。")

Notes

Read all lines from the merged_data.txt file, employing a set to automatically eliminate duplicate lines.
Write the de-duplicated content into a new file, merged_data_unique.txt.
To maintain the order of the files, I used the sorted function to arrange the elements in the set. Should sorting be unnecessary, you may omit the sorted function.

Batch merge: Efficient Duplicate Removal from Multiple CSV Files

Method Enumeration

Step-by-Step Analysis

Python Script

Code Remarks

Eliminating Duplicate Content

Step-by-Step Analysis

Python Script

Notes

Leave a Comment
Cancel reply

Leave a Comment

Batch merge: Efficient Duplicate Removal from Multiple CSV Files

Method Enumeration

Step-by-Step Analysis

Python Script

Code Remarks

Eliminating Duplicate Content

Step-by-Step Analysis

Python Script

Notes

Share This

Block Spam Traffic & Spam Web Crawlers

Installing Python3.10.x Version on Mac and Set the Environment

Batch command to Generate Text Files Based on Lines from a Source File

Leave a Comment Cancel reply

Leave a Comment

Leave a Comment
Cancel reply