高效地将多个 Excel 文件中的多张工作表复制到单个 Excel 文件中

问题描述 投票:0回答:3

我正在尝试将多个 Excel 文件(.xlsx)合并为一个 .xlsx 文件。每个文件大约有 100 张。合并 Excel 文件时,我想要单独的这些工作表。

例如,“excel_1.xlsx”的工作表名称为“1”到“100”,“excel_2.xlsx”的工作表名称为“101”到“200”。然后,当我合并两个 Excel 文件时,新的 Excel 文件“excel_merged.xlsx”应该有 200 个工作表,工作表名称从“1”到“200”。

下面是我写的代码,但是我发现合并完成需要太多时间。

import pandas as pd
import os
import time

# read excel file
excel_files = []
file_name = 'output_240326_pivot_{i}.xlsx'
for i in range(64):
    file_path = f'./our_data/240403/' + file_name.format(i=i)
    excel_files.append(file_path)

# final merging path
final_excel_path = './our_data/240403/240403_final_for_tgt.xlsx'

# start time count for total merging
start_time = time.time()

with pd.ExcelWriter(final_excel_path) as writer:
    # for each excel file
    for file in excel_files:
        # start time count
        file_start_time = time.time()
    
        # Read every sheets in excel file
        xls = pd.ExcelFile(file)
        for sheet_name in xls.sheet_names:
            df = pd.read_excel(xls, sheet_name)
            # Save each sheet seperately using ExcelWriter
            df.to_excel(writer, sheet_name=sheet_name, index=False)
    
        # Time spent for merging current file
        file_end_time = time.time()
        print(f"Completed merging {os.path.basename(file)} in {file_end_time - file_start_time:.2f} seconds.")

# Total time for merging
end_time = time.time()
print(f"All sheets combined into {final_excel_path} in {end_time - start_time:.2f} seconds.")

当我当前运行此代码时,合并所花费的时间似乎像斐波那契顺序一样增加。这是为什么?

Completed merging output_240326_pivot_0.xlsx in 0.64 seconds.
Completed merging output_240326_pivot_1.xlsx in 1.15 seconds.
Completed merging output_240326_pivot_2.xlsx in 2.36 seconds.
Completed merging output_240326_pivot_3.xlsx in 3.98 seconds.
Completed merging output_240326_pivot_4.xlsx in 6.14 seconds.
Completed merging output_240326_pivot_5.xlsx in 8.80 seconds.
Completed merging output_240326_pivot_6.xlsx in 12.24 seconds.
Completed merging output_240326_pivot_7.xlsx in 16.37 seconds.
Completed merging output_240326_pivot_8.xlsx in 21.27 seconds.
Completed merging output_240326_pivot_9.xlsx in 27.38 seconds.
Completed merging output_240326_pivot_10.xlsx in 31.95 seconds.
Completed merging output_240326_pivot_11.xlsx in 38.43 seconds.
Completed merging output_240326_pivot_12.xlsx in 45.42 seconds.
Completed merging output_240326_pivot_13.xlsx in 53.47 seconds.
Completed merging output_240326_pivot_14.xlsx in 61.85 seconds.
Completed merging output_240326_pivot_15.xlsx in 71.27 seconds.
Completed merging output_240326_pivot_16.xlsx in 81.11 seconds.
Completed merging output_240326_pivot_17.xlsx in 91.49 seconds.
Completed merging output_240326_pivot_18.xlsx in 102.94 seconds.

p.s 我也从 stackoverflow 搜索了相关问题,但找不到适合我的情况的合适答案。如果我将每个工作表拆分成单独的 .csv 文件并将它们合并到一个 Excel 文件工作表中,会更快吗?

python pandas excel performance pandas.excelwriter
3个回答
0
投票

处理了 3 个不同的文件并将它们的工作表合并为一个文件。唯一的区别是 ExcelWriter 中的引擎属性。代码片段: 将 pandas 导入为 pd 导入时间 导入操作系统

# Path to the source Excel file
source_excel_path =[ 'data\Employee Data.xlsx', 'data\Employee Data 1.xlsx', 'data\Employee Data 2.xlsx']

# Path to the destination Excel file
destination_excel_path = 'new_excel_file.xlsx'


# Create a Pandas Excel writer using Openpyxl as the engine
with pd.ExcelWriter(destination_excel_path, engine='openpyxl') as writer:
    for file in source_excel_path:
        # start time count
        file_start_time = time.time()

        xls = pd.ExcelFile(file)

        for sheet_name in xls.sheet_names:
            # Read the specific sheet into a DataFrame
            df = pd.read_excel(xls, sheet_name=sheet_name)

            # Write the DataFrame to the new Excel file using the same sheet name
            df.to_excel(writer, sheet_name=sheet_name, index=False)

        # Time spent for merging current file
        file_end_time = time.time()
        print(f"Completed merging {os.path.basename(sheet_name)} in {file_end_time - file_start_time:.2f} seconds.")

print("Sheets copied successfully.")

每个文件的处理时间大致相同。

Completed merging Sheet 5 in 0.64 seconds.
Completed merging Sheet 5 in 0.65 seconds.
Completed merging Sheet 15 in 0.68 seconds.
Sheets copied successfully.

-1
投票

不要使用 ExcelWriter,而是利用 pandas 的内置迭代功能来处理 Excel 工作表。这是一个示例代码片段:

import pandas as pd
import time
import os
# The path to your Excel file
excel_file_path = "data\Employee Data.xlsx"

# Load the Excel file
xls = pd.ExcelFile(excel_file_path)

# List to hold data from each sheet
all_data = []

# Loop through each sheet in the Excel file
for sheet_name in xls.sheet_names:
    # start time count
    file_start_time = time.time()
    # Load the sheet into a DataFrame
    df = pd.read_excel(xls, sheet_name)
    # Append the data from this sheet to the list
    all_data.append(df)
    # Time spent for merging current file
    file_end_time = time.time()
    print(f"Completed merging {os.path.basename(sheet_name)} in {file_end_time - file_start_time:.2f} seconds.")

# Concatenate all the dataframes in the list into a single dataframe
merged_data = pd.concat(all_data)

# Optional: If you want to ignore the index or want a continuous index
merged_data.reset_index(drop=True, inplace=True)

# Save the merged data to a new Excel file
merged_data.to_excel('data/merged_excel_file.xlsx', index=False)

我在单个 Excel 文件中的 5 个工作表中执行了脚本,每个工作表包含 11 列和 1000 条记录。

Completed merging Sheet 1 in 0.05 seconds.
Completed merging Sheet 2 in 0.05 seconds.
Completed merging Sheet 3 in 0.06 seconds.
Completed merging Sheet 4 in 0.05 seconds.
Completed merging Sheet 5 in 0.05 seconds.

每张纸的处理时间大致相同。


-1
投票

随着操作变得密集,复杂性似乎也在增加。有一个嵌套的 for 循环,并且您在 for 循环中打开相同的文件。您可以优化此操作,将所有文件中的所有工作表整理到单个数据框中,然后最后将其转换为 Excel。请参阅示例代码:

all_sheets = []
for file in files:
  xls = pd.ExcelFile(file)
  for sheet_name in xls.sheet_names:
    df = pd.read_excel(xls, sheet_name)
    all_sheets.append(df)
merged_df = pd.concat(all_sheets, ignore_index=True)
merged_df.to_excel(writer, sheet_name="Combined_Sheet", index=False)
writer.save()
© www.soinside.com 2019 - 2024. All rights reserved.