读取Excel文件中非表格的表格

Question

我目前正在尝试读取包含多个表格的Excel文件，但这些表格没有保存为表格，它们就像信息一样保存，但分为表格（不知道这是否有意义），我已经到了这样的地步我可以读取 Excel 文件，但它也会读取我不感兴趣的空单元格，我只需要读取表格。

这是我到目前为止编写的代码：

def separar_tablas(df): 
    tablas = [] 
    dfs = [] 
    tabla_actual = None

    for _, row in df.iterrows():
        for i, value in row.items():
            etiqueta = str(row[2])
    
            if pd.notnull(value):          
                if etiqueta.startswith('RuleTable'):
                    if pd.notnull(etiqueta):
                        tablas.append(etiqueta)
        
                        if tabla_actual is not None:
                            dfs.append(tabla_actual)
                    
                        tabla_actual = pd.DataFrame(columns=df.columns)
                    
                    tabla_actual = pd.concat([tabla_actual, row],axis = 1, ignore_index=True)
                    
        
    dfs.append(tabla_actual)
    return dfs

Answer 1

你能做的就是通过两次通行证。

获取表格开始的单元格的坐标
循环这些坐标以批量读取表格

！警告注意！：如果您的其中一个表格包含空单元格，则您必须先调整代码以定义表格的大小，然后再逐个单元格读取表格，否则可能会剪切表格。

我只用一个 excel 文件测试了这段代码，如果有问题你应该告诉我。

import pandas as pd

# read your file
df = pd.read_excel('data/test.xlsx', sheet_name='Sheet1')

# First pass: get table start coordinates
table_starts_coordinates = []
for rowId, row in df.iterrows():
    for colId, value in row.items():
        if str(value).startswith('RuleTable'):
            table_starts_coordinates.append((rowId, str(colId).replace('Unnamed: ', '')))

tables = []
for row_id, col_id in table_starts_coordinates:
    # Iterate upwards until a blank cell is encountered
    r: int = int(row_id)
    c: int = int(col_id)
    i = 0
    table_data = []
    # Iterate through the DataFrame starting from the table start coordinates
    # until a blank cell is encountered on both direction.
    for _, row in df.iloc[r:].iterrows():
        j = 0
        if pd.isnull(row[c]):
            break
        for _, cell in row[c:].items():
            if pd.isnull(cell):
                break
            table_data.append((cell, i, j))
            j += 1
        i += 1
    
    # Create a DataFrame from the table data
    # table_data[-1] contains the last cell coordinates so i use it to define the shape of the DataFrame
    # if your table contains empty cells you should do one pass on the first row to get the number of columns
    # and then one pass on the first column to get the number of rows and then read the data (what I do above)
    table_df = pd.DataFrame(index=range(table_data[-1][1]), columns=range(table_data[-1][2]))
    for val, i, j in table_data:
        table_df.loc[i, j] = val

    tables.append(table_df)

Answer 2

read_excel

不支持范围或名称。您必须使用 openpyxl 读取范围或表的值并从中创建 DataFrame。例如下面的代码：

path=r"c:\projects\Spikes\Book1.xlsx"

wb=load_workbook(filename=path)
ws=wb['Sheet1']
table=ws.tables["Table1"]
rng=ws[table.ref]

data_rows=[]
for row in rng:
    data_rows.append([cell.value for cell in row])

print(table.ref)
pd.DataFrame(data_rows[1:], columns=data_rows[0])

给定一张包含名为

Table1

的表的工作表：

打印

>>> print(table.ref)
D6:F9
>>> pd.DataFrame(data_rows[1:], columns=data_rows[0])
    A  B       C
0   1  5  Banana
1   2  6  Potato
2  34  8  Tomato

读取Excel文件中非表格的表格

问题描述投票：0回答：2

2个回答

最新问题

读取Excel文件中非表格的表格

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2