我正在尝试将多个 Excel 文件(.xlsx)合并为一个 .xlsx 文件。每个文件大约有 100 张。合并 Excel 文件时,我想要单独的这些工作表。
例如,“excel_1.xlsx”的工作表名称为“1”到“100”,“excel_2.xlsx”的工作表名称为“101”到“200”。然后,当我合并两个 Excel 文件时,新的 Excel 文件“excel_merged.xlsx”应该有 200 个工作表,工作表名称从“1”到“200”。
下面是我写的代码,但是我发现合并完成需要太多时间。
import pandas as pd
import os
import time
# read excel file
excel_files = []
file_name = 'output_240326_pivot_{i}.xlsx'
for i in range(64):
file_path = f'./our_data/240403/' + file_name.format(i=i)
excel_files.append(file_path)
# final merging path
final_excel_path = './our_data/240403/240403_final_for_tgt.xlsx'
# start time count for total merging
start_time = time.time()
with pd.ExcelWriter(final_excel_path) as writer:
# for each excel file
for file in excel_files:
# start time count
file_start_time = time.time()
# Read every sheets in excel file
xls = pd.ExcelFile(file)
for sheet_name in xls.sheet_names:
df = pd.read_excel(xls, sheet_name)
# Save each sheet seperately using ExcelWriter
df.to_excel(writer, sheet_name=sheet_name, index=False)
# Time spent for merging current file
file_end_time = time.time()
print(f"Completed merging {os.path.basename(file)} in {file_end_time - file_start_time:.2f} seconds.")
# Total time for merging
end_time = time.time()
print(f"All sheets combined into {final_excel_path} in {end_time - start_time:.2f} seconds.")
当我当前运行此代码时,合并所花费的时间似乎像斐波那契顺序一样增加。这是为什么?
Completed merging output_240326_pivot_0.xlsx in 0.64 seconds.
Completed merging output_240326_pivot_1.xlsx in 1.15 seconds.
Completed merging output_240326_pivot_2.xlsx in 2.36 seconds.
Completed merging output_240326_pivot_3.xlsx in 3.98 seconds.
Completed merging output_240326_pivot_4.xlsx in 6.14 seconds.
Completed merging output_240326_pivot_5.xlsx in 8.80 seconds.
Completed merging output_240326_pivot_6.xlsx in 12.24 seconds.
Completed merging output_240326_pivot_7.xlsx in 16.37 seconds.
Completed merging output_240326_pivot_8.xlsx in 21.27 seconds.
Completed merging output_240326_pivot_9.xlsx in 27.38 seconds.
Completed merging output_240326_pivot_10.xlsx in 31.95 seconds.
Completed merging output_240326_pivot_11.xlsx in 38.43 seconds.
Completed merging output_240326_pivot_12.xlsx in 45.42 seconds.
Completed merging output_240326_pivot_13.xlsx in 53.47 seconds.
Completed merging output_240326_pivot_14.xlsx in 61.85 seconds.
Completed merging output_240326_pivot_15.xlsx in 71.27 seconds.
Completed merging output_240326_pivot_16.xlsx in 81.11 seconds.
Completed merging output_240326_pivot_17.xlsx in 91.49 seconds.
Completed merging output_240326_pivot_18.xlsx in 102.94 seconds.
p.s 我也从 stackoverflow 搜索了相关问题,但找不到适合我的情况的合适答案。如果我将每个工作表拆分成单独的 .csv 文件并将它们合并到一个 Excel 文件工作表中,会更快吗?
处理了 3 个不同的文件并将它们的工作表合并为一个文件。唯一的区别是 ExcelWriter 中的引擎属性。代码片段: 将 pandas 导入为 pd 导入时间 导入操作系统
# Path to the source Excel file
source_excel_path =[ 'data\Employee Data.xlsx', 'data\Employee Data 1.xlsx', 'data\Employee Data 2.xlsx']
# Path to the destination Excel file
destination_excel_path = 'new_excel_file.xlsx'
# Create a Pandas Excel writer using Openpyxl as the engine
with pd.ExcelWriter(destination_excel_path, engine='openpyxl') as writer:
for file in source_excel_path:
# start time count
file_start_time = time.time()
xls = pd.ExcelFile(file)
for sheet_name in xls.sheet_names:
# Read the specific sheet into a DataFrame
df = pd.read_excel(xls, sheet_name=sheet_name)
# Write the DataFrame to the new Excel file using the same sheet name
df.to_excel(writer, sheet_name=sheet_name, index=False)
# Time spent for merging current file
file_end_time = time.time()
print(f"Completed merging {os.path.basename(sheet_name)} in {file_end_time - file_start_time:.2f} seconds.")
print("Sheets copied successfully.")
每个文件的处理时间大致相同。
Completed merging Sheet 5 in 0.64 seconds.
Completed merging Sheet 5 in 0.65 seconds.
Completed merging Sheet 15 in 0.68 seconds.
Sheets copied successfully.
不要使用 ExcelWriter,而是利用 pandas 的内置迭代功能来处理 Excel 工作表。这是一个示例代码片段:
import pandas as pd
import time
import os
# The path to your Excel file
excel_file_path = "data\Employee Data.xlsx"
# Load the Excel file
xls = pd.ExcelFile(excel_file_path)
# List to hold data from each sheet
all_data = []
# Loop through each sheet in the Excel file
for sheet_name in xls.sheet_names:
# start time count
file_start_time = time.time()
# Load the sheet into a DataFrame
df = pd.read_excel(xls, sheet_name)
# Append the data from this sheet to the list
all_data.append(df)
# Time spent for merging current file
file_end_time = time.time()
print(f"Completed merging {os.path.basename(sheet_name)} in {file_end_time - file_start_time:.2f} seconds.")
# Concatenate all the dataframes in the list into a single dataframe
merged_data = pd.concat(all_data)
# Optional: If you want to ignore the index or want a continuous index
merged_data.reset_index(drop=True, inplace=True)
# Save the merged data to a new Excel file
merged_data.to_excel('data/merged_excel_file.xlsx', index=False)
我在单个 Excel 文件中的 5 个工作表中执行了脚本,每个工作表包含 11 列和 1000 条记录。
Completed merging Sheet 1 in 0.05 seconds.
Completed merging Sheet 2 in 0.05 seconds.
Completed merging Sheet 3 in 0.06 seconds.
Completed merging Sheet 4 in 0.05 seconds.
Completed merging Sheet 5 in 0.05 seconds.
每张纸的处理时间大致相同。
随着操作变得密集,复杂性似乎也在增加。有一个嵌套的 for 循环,并且您在 for 循环中打开相同的文件。您可以优化此操作,将所有文件中的所有工作表整理到单个数据框中,然后最后将其转换为 Excel。请参阅示例代码:
all_sheets = []
for file in files:
xls = pd.ExcelFile(file)
for sheet_name in xls.sheet_names:
df = pd.read_excel(xls, sheet_name)
all_sheets.append(df)
merged_df = pd.concat(all_sheets, ignore_index=True)
merged_df.to_excel(writer, sheet_name="Combined_Sheet", index=False)
writer.save()