合并 CSV 文件的标头并在缺失标头处填充 0

问题描述 投票:0回答:1

你好

我想合并多个 CSV 文件。

我有一个包含所有标题的引用.csv 文件。

我有几个 CSV 文件,它们可以有不同的标题。

我想将参考标题合并到所有其他 CSV 文件中。此外,我想为所有已添加的下一行填充值 0。


有一个例子:

reference.csv(我的标题)

a;b;c;d

文件1.csv

a;c 
45;68

预期结果:


文件1.csv

a;b;c;d
45;0;68;0

#我尝试过这个:

(也尝试过 Pandas,但没有按我想要的方式工作)

import os
import csv

# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)

# Define the folder containing the CSV files
folder_path = script_directory

# Define the path to the reference text file
reference_file_path = os.path.join(folder_path, "reference.txt")

# Read the reference columns from the text file
with open(reference_file_path, "r") as reference_file:
    reference_columns = reference_file.read().splitlines()

# Read the header from the file with the most columns
with open(os.path.join(folder_path, reference_file_path ), "r", newline="") as file:
    header_to_copy = next(csv.reader(file, delimiter=";"))

# Iterate through each CSV file and copy the header
for file_name in reference_file_path:
    file_path = os.path.join(folder_path, file_name)
    data = []

    with open(file_path, "r", newline="") as file:
        reader = csv.reader(file, delimiter=";")
        for row in reader:
            data.append(row)

    # Update the header in the current file
    original_header = data[0]
    data[0] = header_to_copy

    # Print the differences between the original header and the copied header
    differences = set(original_header) ^ set(header_to_copy)
    print(f"Processed: {file_name}")
    print("Differences:")
    print("+----------------+----------------+")
    print("| Original Header | Copied Header  |")
    print("+----------------+----------------+")

    for item in differences:
        original_present = "Yes" if item in original_header else "No"
        copied_present = "Yes" if item in header_to_copy else "No"
        print(f"| {item:<16}| {original_present:<16}| {copied_present:<16}|")

    print("+----------------+----------------+")

    # Print the position of each element in the header if it was not in the original header
    position_dict = {element: position for position, element in enumerate(header_to_copy, start=1)}
    for element in differences:
        if element in header_to_copy:
            position = position_dict[element]
            print(f"Element '{element}' is at position {position} in the header.")

    # Add a new column with the value "0" from the second row onwards at the specified position
    for i in range(1, len(data)):
        for element in differences:
            if element in header_to_copy:
                position = position_dict[element]
                data[i].insert(position - 1, "0")

    # Write the modified data back to the CSV file
    with open(file_path, "w", newline="") as file:
        writer = csv.writer(file, delimiter=";")
        writer.writerows(data)

我正在考虑使用“;”的位置或数量并用它添加一个新的“0”值。但是当有特定情况时它不起作用:

尝试过熊猫:

import pandas as pd
import os

# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)

# Define the folder containing the CSV files (same location as the script)
folder_path = script_directory

# Path to the reference.csv file
reference_csv = os.path.join(script_directory, "ref", "reference.csv")

# Read the reference.csv file
reference_df = pd.read_csv(reference_csv)

# List all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith(".csv")]

# Initialize an empty list to store DataFrames
dfs = []

# Iterate through the CSV files in the folder
for csv_file in csv_files:
    # Construct the full path to the current CSV file
    csv_file_path = os.path.join(folder_path, csv_file)
    
    # Read the current CSV file
    current_df = pd.read_csv(csv_file_path)
    
    # Copy the header row from the reference DataFrame
    reference_header = reference_df.iloc[0].copy()
    
    # Find the positions where the headers differ
    diff_positions = [i for i, (ref_col, cur_col) in enumerate(zip(reference_header, current_df.columns)) if ref_col != cur_col]
    
    # Print the positions if there are differences
    if diff_positions:
        print(f"Differences in headers for {csv_file}:")
        print(diff_positions)
    
    # Concatenate the reference header row with the current DataFrame and fill missing columns with 0
    current_df = pd.concat([reference_header, current_df], ignore_index=True, axis=0).fillna(0)
    
    # Append the merged DataFrame to the list
    dfs.append(current_df)

# Concatenate all DataFrames in the list along columns
result_df = pd.concat(dfs, axis=1, ignore_index=True)

# Save the merged DataFrame to a new CSV file
result_df.to_csv(os.path.join(script_directory, "merged_data.csv"), index=False)

但是有这个错误:

  File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 51 fields in line 18153, saw 52

谢谢

python csv merge datatable
1个回答
0
投票
一旦我们确保

列名称相同,Pandas 的concat是最简单的选择。在第二个文件中,标题行有尾随空格。

>>> from io import StringIO
>>> file1="""a;b;c;d"""
>>> file2="""
... a;c
... 45;68"""
>>>
>>> df_ref=pd.read_csv(StringIO(file1),sep=';')
>>> df_data=pd.read_csv(StringIO(file2),sep=';')

第二个文件中的

c
列包含尾随空格。

>>> df_data.columns
Index(['a', 'c '], dtype='object')

我们需要修剪它以使列匹配

>>> # Trim whitespace in the column names
>>> df_data.columns=[c.strip() for c in df_data.columns]

之后,对

pd.concat
的一次调用将合并两个数据帧。

>>> df_combined=pd.concat([df_ref,df_data])
>>> df_combined
    a    b   c    d
0  45  NaN  68  NaN

可以使用

fillna
将缺失值替换为 0 :

df_combined=df_combined.fillna(0)

我们可以使用 to_csv

将组合数据写入文件
df_combined.to_csv(target_path,sep=';')

连接大量文件

pd.concat
适用于任何数据帧序列。我们可以使用
Path.rglob
按顺序加载所有文件并将它们连接起来。

from pathlib import Path
root=Path('/path/to/folder')

df_all=pd.concat([pd.read_csv(csv_path,sep=';') for csv_path in root.rglob("*.csv")])

要清理数据,我们可以使用单独的函数:

def load_with_trim(csv_path) :
    df=pd.read_csv(csv_path,sep=';')
    df.columns=[c.strip() for c in df.columns]
    return df


all_files=[load_with_trim(csv_path) for csv_path in root.rglob("*.csv")]
df_all=pd.concat(all_files)
© www.soinside.com 2019 - 2024. All rights reserved.