我有一个 Excel 工作簿,其中一些工作表包含保险计划详细信息的文本/表格中的数据,一些工作表包含详细信息的图像/屏幕截图,一些工作表包含链接的 Excel 文件(作为对象插入)。
使用Python如何迭代每个工作表,如果有链接的Excel文件我想将它们存储在另一个目录中?
我尝试了这段代码:
import os
import shutil
import pathlib
import zipfile
import openpyxl
import re
def extract_linked_excel_files(path, output_folder_name='extracted_excel_files'):
"""
Extracts linked Excel files from an Excel file and stores them in a single folder.
Args:
path (pathlib.Path or str): Excel file path.
output_folder_name (str): Name of the folder to store the extracted Excel files.
Defaults to 'extracted_excel_files'.
Returns:
new_paths (list[pathlib.Path]): List of paths to the extracted Excel files.
"""
# Convert path to pathlib.Path if it's a string
if isinstance(path, str):
path = pathlib.Path(path)
# Check if the file has the '.xlsx' extension
if path.suffix != '.xlsx':
raise ValueError('Path must be an xlsx file')
# Extract the filename (excluding the extension) using .stem
name = path.stem
# Create a new folder for the extracted Excel files
output_folder = path.parent / output_folder_name
output_folder.mkdir(exist_ok=True) # Create folder if it doesn't exist
# Open the workbook
workbook = openpyxl.load_workbook(path, read_only=True)
# List to store the paths of the extracted Excel files
new_paths = []
try:
# Iterate through all sheets in the workbook
for sheet_name in workbook.sheetnames:
sheet = workbook[sheet_name]
# Iterate through all rows in the sheet
for row in sheet.iter_rows():
# Iterate through all cells in the row
for cell in row:
# Check if the cell contains a hyperlink
if cell.hyperlink:
# Check if the hyperlink is an Excel file
hyperlink_target = cell.hyperlink.target
if re.search(r'\.xlsx$', hyperlink_target, re.IGNORECASE):
linked_file_path = hyperlink_target.replace('/', os.path.sep)
# Construct paths for the linked Excel file and the new destination
linked_file_full_path = path.parent / linked_file_path
new_path = output_folder / linked_file_path
# Copy the linked Excel file to the output folder
shutil.copy(linked_file_full_path, new_path)
# Store the new path in the list
new_paths.append(new_path)
finally:
# Close the workbook
workbook.close()
# Return the list of paths to the extracted Excel files
return new_paths
然而这只是返回一个空的工作簿`
您的代码似乎走在正确的轨道上,但可能有几个原因导致它找不到链接的 Excel 文件。以下是一些排除故障和改进代码的建议:
检查超链接目标: 确保 Excel 工作表中的超链接目标指向正确的 Excel 文件。验证超链接确实链接到扩展名为 .xlsx 的 Excel 文件。
使用完整路径进行比较: 比较路径时,最好使用完整路径以避免与相对路径相关的问题。更新比较以在路径上使用resolve()方法:
如果 linked_file_full_path.resolve() == path.resolve(): # 这是原来的Excel文件,跳过 继续
调试输出: 添加一些打印语句以帮助调试代码。例如,打印 linked_file_full_path 和 new_path 来查看它们是否构建正确:
print(f"链接文件:{linked_file_full_path}") print(f"新路径:{new_path}")
这可以帮助识别路径构建的任何问题。
处理不同的路径分隔符: 由于您使用的 Excel 文件可能包含以正斜杠 (/) 作为分隔符的超链接,因此您应该使用 os.path.sep 规范化路径分隔符:
linked_file_path = hyperlink_target.replace('/', os.path.sep)
确保您使用适合您的操作系统的正确路径分隔符。
进行这些调整后,您应该能够识别并纠正任何阻止提取链接的 Excel 文件的问题。如果问题仍然存在,请打印附加信息进行调试,这将有助于查明问题。