我正在尝试从 word 文件 .docx 的表中提取数据,并使用 python 将其转换为数据帧。 注意:它是阿拉伯文本,所以我使用编码(“utf-8”)
到目前为止,我能够获取 .docx 文件并获取表格(它有 13 列),但我无法显示文本。
我的代码错误在哪里?
import pandas as pd
import docx
document = docx.Document(path)
table = document.tables[0]
print(table)
data = []
for row in table.rows:
text = (row.cells[0].paragraphs[0].text.encode('utf-8'))
data.append(text)
print(data)
df = pd.DataFrame(data)
结果:
[b'']
[b'', b'']
[b'', b'', b'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x84\xd8\xb7\xd8\xa7\xd8\xaa']
[b'', b'', b'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x84\xd8\xb7\xd8\xa7\xd8\xaa', b'']
[b'', b'', b'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x84\xd8\xb7\xd8\xa7\xd8\xaa', b'', b'\xd9\x81\xd8\xb1\xd8\xb9\xd8\xa7\xd9\x84\xd8\xaa\xd8\xad\xd9\x84\xd9\x8a\xd9\x84\xc2\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n']
[b'', b'', b'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x84\xd8\xb7\xd8\xa7\xd8\xaa', b'', b'\xd9\x81\xd8\xb1\xd8\xb9\xd8\xa7\xd9\x84\xd8\xaa\xd8\xad\xd9\x84\xd9\x8a\xd9\x84\xc2\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n', b'']
首先,您应该删除
.encode('utf-8')
,这就是您的结果不可读的原因。
其次,您仅访问每行的第一列 (
row.cells[0]
)。
您需要添加另一个
for loop
来浏览各列,如下所示:
for row_index, row in enumerate(table.rows):
data.append([])
for col_index in range(13): # Loop through columns
cell_text= row.cells[col_index].paragraphs[0].text
data[row_index].append(cell_text)
import pandas as pd
from docx import Document
def extract_table_to_dataframe(docx_file):
# Open the .docx file
doc = Document(docx_file)
# Initialize an empty DataFrame
df = pd.DataFrame()
# Iterate through all the tables in the document
for table in doc.tables:
# Initialize an empty list to store table data
table_data = []
# Iterate through each row in the table
for row in table.rows:
# Extract the text content from each cell in the row and append it to the table data list
row_data = [cell.text.strip() for cell in row.cells]
table_data.append(row_data)
# Convert the table data list into a DataFrame and append it to the result DataFrame
table_df = pd.DataFrame(table_data)
df = df.append(table_df, ignore_index=True)
return df
# Call the function to extract table data into a DataFrame
docx_file = 'Sample.docx' # Replace with your .docx file path
result_df = extract_table_to_dataframe(docx_file)
# Print or further process the DataFrame
print(result_df)