我需要将多个.xlsx文件合并成工作表,其中每个工作表名称必须是文件名。
当前问题
下面的代码在几个文件之后变得很慢并且花费大量内存。
尝试过的解决方案
关闭Excel文件并删除数据框并手动运行gc,不起作用。
代码
import pandas as pd
import openpyxl
import os
import gc as gc
print("Copying sheets from multiple files to one file")
dir_input = 'D:/MeusProjetosJava/Importacao/'
dir_output = "Integrados/combined.xlsx"
cwd = os.path.abspath(dir_input)
files = os.listdir(cwd)
df_total = pd.DataFrame()
df_total.to_excel(dir_output) #create a new file
workbook=openpyxl.load_workbook(dir_output)
ss_sheet = workbook['Sheet1']
ss_sheet.title = 'TempExcelSheetForDeleting'
workbook.save(dir_output)
for file in files: # loop through Excel files
if file.endswith('.xls') or file.endswith('.xlsx'):
excel_file = pd.ExcelFile(cwd+"/"+file)
sheets = excel_file.sheet_names
for sheet in sheets:
sheet_name = str(file.title())
sheet_name = sheet_name.replace(".xlsx","").lower()
sheet_name = sheet_name.removesuffix(".xlsx")
print(file, sheet_name)
df = excel_file.parse(sheet_name = sheet)
with pd.ExcelWriter(dir_output,mode='a') as writer:
df.to_excel(writer, sheet_name=f"{sheet_name}", index=False)
del df
excel_file.close()
del excel_file
sheets = None
gc.collect()
workbook=openpyxl.load_workbook(dir_output)
std=workbook["TempExcelSheetForDeleting"]
workbook.remove(std)
workbook.save(dir_output)
print("all done")
** 参考资料 **
我认为你的代码有点复杂并且创建了一些不必要的临时对象。我会首先尝试一种简单的方法,即使用 Pandas ExcelWriter,因此模板代码将是这样的。您的文件是否真的很大,导致内存问题?
# Don't like the dir_output name as its the final file output name
with pd.ExcelWriter(dir_output , mode='a') as writer:
for file in files:
if file.endswith('.xls') or file.endswith('.xlsx'):
# get the name of the file in cur_sheet YOUR CODE
df = pd.read_excel(file)
df.to_excel(writer, sheet_name=cur_sheet)