使用 Python 导入具有可变结构的 DataFrame XLS

Question

几天前我收到了一个有点难以处理的数据集，我在这个数据集中看到的唯一固定的是记录本身总是从第 9 行开始，列名在第 7 行. 如下图所示：

如您所见，有合并的单元格，我突出显示的列是因为有时它们不存在于结构中，有时它们是空白的，有时它们显示每个类的小计。由于列数是可变的，我想出的算法是先读到最后一列，它总是包含字母“CD”，然后我会删除空白列和名称中包含单词“TOTAL”的列. 至于行，它们对于每个文件也不是固定的，所以我认为它应该只被读取到包含单词“TOTAL”的“A”列中的单元格之前。它们是相当多的文件，因为每个帧对应于特定年份的一个月，我必须通过按行连接所有这些帧来创建一个单一的基础，我将使用 pd.concat(list_dataframes, axis=0, ignore_index=真）。但是为此，我首先需要所有这些都具有单一结构。我怎样才能阅读这些类型的文件并进行提到的更改？。预先感谢您阅读所有内容！

预期的输出是这样的

Answer 1

不确定这是否匹配文件中的所有工作表；我将这种方法基于我们在评论中的对话。看看是否有改进的空间/更好的逻辑处理：

df = pd.read_excel('Downloads/test.xlsx')
# get rows between Companies and TOTAL AF
top = df.index[df.eq('Companies').any(axis=1)].item()
bottom = df.index[df.eq('TOTAL AF').any(axis=1)].item()
# get rid of completely empty rows and columns, if any
temp = df.iloc[top+1:bottom].dropna(how = 'all', axis = 1).dropna(how='all')
# move the x,y,z columns as index, for easy reshaping
temp = temp.set_index(temp.iloc[:, 0].name)
# shift the row just above x as columns
temp.columns = temp.iloc[0]
temp = temp.iloc[1:]
temp.columns.name = None
temp.index.name = 'Companies'
temp.columns = temp.columns.fillna('TOTAL CD')
temp.reset_index()
  Companies A_Subclass1 A_Subclass2 A_Subclass3 B_Subclass1 B_Subclass2 B_Subclass3 C_Subclass1 C_Subclass2 C_Subclass3 D_Subclass1 D_Subclass2 D_Subclass3 TOTAL CD
0         X     3264371      228179      303407      300768       41407       40824      342168       37129       25354      749955        5615       15923  5355100
1         Y      259926       60609       85586         739         103         981      101961        2590        8842        3974           0           0   525311
2         Z     6788748      826830      642568       86356         471        4630      588506        3850       32907      767853       15176       62557  9820452

使用 Python 导入具有可变结构的 DataFrame XLS

问题描述投票：0回答：1

1个回答

最新问题

使用 Python 导入具有可变结构的 DataFrame XLS

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1