读取Excel文件中非表格的表格

问题描述 投票:0回答:2

我目前正在尝试读取包含多个表格的Excel文件,但这些表格没有保存为表格,它们就像信息一样保存,但分为表格(不知道这是否有意义),我已经到了这样的地步我可以读取 Excel 文件,但它也会读取我不感兴趣的空单元格,我只需要读取表格。

这是我到目前为止编写的代码:

def separar_tablas(df): 
    tablas = [] 
    dfs = [] 
    tabla_actual = None

    for _, row in df.iterrows():
        for i, value in row.items():
            etiqueta = str(row[2])
    
            if pd.notnull(value):          
                if etiqueta.startswith('RuleTable'):
                    if pd.notnull(etiqueta):
                        tablas.append(etiqueta)
        
                        if tabla_actual is not None:
                            dfs.append(tabla_actual)
                    
                        tabla_actual = pd.DataFrame(columns=df.columns)
                    
                    tabla_actual = pd.concat([tabla_actual, row],axis = 1, ignore_index=True)
                    
        
    dfs.append(tabla_actual)
    return dfs
python pandas excel dataframe
2个回答
0
投票

你能做的就是通过两次通行证。

  1. 获取表格开始的单元格的坐标
  2. 循环这些坐标以批量读取表格

!警告注意!:如果您的其中一个表格包含空单元格,则您必须先调整代码以定义表格的大小,然后再逐个单元格读取表格,否则可能会剪切表格。

我只用一个 excel 文件测试了这段代码,如果有问题你应该告诉我。

import pandas as pd

# read your file
df = pd.read_excel('data/test.xlsx', sheet_name='Sheet1')

# First pass: get table start coordinates
table_starts_coordinates = []
for rowId, row in df.iterrows():
    for colId, value in row.items():
        if str(value).startswith('RuleTable'):
            table_starts_coordinates.append((rowId, str(colId).replace('Unnamed: ', '')))

tables = []
for row_id, col_id in table_starts_coordinates:
    # Iterate upwards until a blank cell is encountered
    r: int = int(row_id)
    c: int = int(col_id)
    i = 0
    table_data = []
    # Iterate through the DataFrame starting from the table start coordinates
    # until a blank cell is encountered on both direction.
    for _, row in df.iloc[r:].iterrows():
        j = 0
        if pd.isnull(row[c]):
            break
        for _, cell in row[c:].items():
            if pd.isnull(cell):
                break
            table_data.append((cell, i, j))
            j += 1
        i += 1
    
    # Create a DataFrame from the table data
    # table_data[-1] contains the last cell coordinates so i use it to define the shape of the DataFrame
    # if your table contains empty cells you should do one pass on the first row to get the number of columns
    # and then one pass on the first column to get the number of rows and then read the data (what I do above)
    table_df = pd.DataFrame(index=range(table_data[-1][1]), columns=range(table_data[-1][2]))
    for val, i, j in table_data:
        table_df.loc[i, j] = val

    tables.append(table_df)

0
投票

read_excel
不支持范围或名称。您必须使用 openpyxl 读取范围或表的值并从中创建 DataFrame。例如下面的代码:

path=r"c:\projects\Spikes\Book1.xlsx"

wb=load_workbook(filename=path)
ws=wb['Sheet1']
table=ws.tables["Table1"]
rng=ws[table.ref]

data_rows=[]
for row in rng:
    data_rows.append([cell.value for cell in row])

print(table.ref)
pd.DataFrame(data_rows[1:], columns=data_rows[0])

给定一张包含名为

Table1
的表的工作表:

打印

>>> print(table.ref)
D6:F9
>>> pd.DataFrame(data_rows[1:], columns=data_rows[0])
    A  B       C
0   1  5  Banana
1   2  6  Potato
2  34  8  Tomato
© www.soinside.com 2019 - 2024. All rights reserved.