我有一个包含所有标题的引用.csv 文件。
我有几个 CSV 文件,它们可以有不同的标题。
我想将参考标题合并到所有其他 CSV 文件中。此外,我想为所有已添加的下一行填充值 0。
有一个例子:
reference.csv(我的标题)
a;b;c;d
文件1.csv
a;c
45;68
预期结果:
文件1.csv
a;b;c;d
45;0;68;0
#我尝试过这个:
(也尝试过 Pandas,但没有按我想要的方式工作)
import os
import csv
# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)
# Define the folder containing the CSV files
folder_path = script_directory
# Define the path to the reference text file
reference_file_path = os.path.join(folder_path, "reference.txt")
# Read the reference columns from the text file
with open(reference_file_path, "r") as reference_file:
reference_columns = reference_file.read().splitlines()
# Read the header from the file with the most columns
with open(os.path.join(folder_path, reference_file_path ), "r", newline="") as file:
header_to_copy = next(csv.reader(file, delimiter=";"))
# Iterate through each CSV file and copy the header
for file_name in reference_file_path:
file_path = os.path.join(folder_path, file_name)
data = []
with open(file_path, "r", newline="") as file:
reader = csv.reader(file, delimiter=";")
for row in reader:
data.append(row)
# Update the header in the current file
original_header = data[0]
data[0] = header_to_copy
# Print the differences between the original header and the copied header
differences = set(original_header) ^ set(header_to_copy)
print(f"Processed: {file_name}")
print("Differences:")
print("+----------------+----------------+")
print("| Original Header | Copied Header |")
print("+----------------+----------------+")
for item in differences:
original_present = "Yes" if item in original_header else "No"
copied_present = "Yes" if item in header_to_copy else "No"
print(f"| {item:<16}| {original_present:<16}| {copied_present:<16}|")
print("+----------------+----------------+")
# Print the position of each element in the header if it was not in the original header
position_dict = {element: position for position, element in enumerate(header_to_copy, start=1)}
for element in differences:
if element in header_to_copy:
position = position_dict[element]
print(f"Element '{element}' is at position {position} in the header.")
# Add a new column with the value "0" from the second row onwards at the specified position
for i in range(1, len(data)):
for element in differences:
if element in header_to_copy:
position = position_dict[element]
data[i].insert(position - 1, "0")
# Write the modified data back to the CSV file
with open(file_path, "w", newline="") as file:
writer = csv.writer(file, delimiter=";")
writer.writerows(data)
我正在考虑使用“;”的位置或数量并用它添加一个新的“0”值。但是当有特定情况时它不起作用:
尝试过熊猫:
import pandas as pd
import os
# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)
# Define the folder containing the CSV files (same location as the script)
folder_path = script_directory
# Path to the reference.csv file
reference_csv = os.path.join(script_directory, "ref", "reference.csv")
# Read the reference.csv file
reference_df = pd.read_csv(reference_csv)
# List all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith(".csv")]
# Initialize an empty list to store DataFrames
dfs = []
# Iterate through the CSV files in the folder
for csv_file in csv_files:
# Construct the full path to the current CSV file
csv_file_path = os.path.join(folder_path, csv_file)
# Read the current CSV file
current_df = pd.read_csv(csv_file_path)
# Copy the header row from the reference DataFrame
reference_header = reference_df.iloc[0].copy()
# Find the positions where the headers differ
diff_positions = [i for i, (ref_col, cur_col) in enumerate(zip(reference_header, current_df.columns)) if ref_col != cur_col]
# Print the positions if there are differences
if diff_positions:
print(f"Differences in headers for {csv_file}:")
print(diff_positions)
# Concatenate the reference header row with the current DataFrame and fill missing columns with 0
current_df = pd.concat([reference_header, current_df], ignore_index=True, axis=0).fillna(0)
# Append the merged DataFrame to the list
dfs.append(current_df)
# Concatenate all DataFrames in the list along columns
result_df = pd.concat(dfs, axis=1, ignore_index=True)
# Save the merged DataFrame to a new CSV file
result_df.to_csv(os.path.join(script_directory, "merged_data.csv"), index=False)
但是有这个错误:
File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 51 fields in line 18153, saw 52
谢谢
列名称相同,Pandas 的concat是最简单的选择。在第二个文件中,标题行有尾随空格。
>>> from io import StringIO
>>> file1="""a;b;c;d"""
>>> file2="""
... a;c
... 45;68"""
>>>
>>> df_ref=pd.read_csv(StringIO(file1),sep=';')
>>> df_data=pd.read_csv(StringIO(file2),sep=';')
第二个文件中的
c
列包含尾随空格。
>>> df_data.columns
Index(['a', 'c '], dtype='object')
我们需要修剪它以使列匹配
>>> # Trim whitespace in the column names
>>> df_data.columns=[c.strip() for c in df_data.columns]
之后,对
pd.concat
的一次调用将合并两个数据帧。
>>> df_combined=pd.concat([df_ref,df_data])
>>> df_combined
a b c d
0 45 NaN 68 NaN
可以使用
fillna将缺失值替换为
0
:
df_combined=df_combined.fillna(0)
我们可以使用 to_csv
将组合数据写入文件df_combined.to_csv(target_path,sep=';')
连接大量文件
pd.concat
适用于任何数据帧序列。我们可以使用 Path.rglob
按顺序加载所有文件并将它们连接起来。
from pathlib import Path
root=Path('/path/to/folder')
df_all=pd.concat([pd.read_csv(csv_path,sep=';') for csv_path in root.rglob("*.csv")])
要清理数据,我们可以使用单独的函数:
def load_with_trim(csv_path) :
df=pd.read_csv(csv_path,sep=';')
df.columns=[c.strip() for c in df.columns]
return df
all_files=[load_with_trim(csv_path) for csv_path in root.rglob("*.csv")]
df_all=pd.concat(all_files)