以下代码生成一个数据框
import pandas as pd
import tabula
page_number = "1"
pdf_url = "https://usviber.org/wp-content/uploads/2023/12/A23-OCT.pdf"
# Reads the PDF
tables = tabula.read_pdf(pdf_url, pages=page_number)
df = tables[1]
# Selects relvant columns and rows
numeric_columns = df.select_dtypes(include=["number"])
df = df.drop(numeric_columns.columns[(numeric_columns < 0).any()], axis=1)
df = df.loc[2:13, :].iloc[:, :5]
# Set the column index to the island names
df.set_index(df.columns[0], inplace=True)
# Rename columns based on year
df.columns = pd.MultiIndex.from_product(
[["St Thomas", "St. Croix"], ["2022", "2023"]], names=["Island", "Year"]
)
# Map the index to uppercase and extract the first 3 characters
df.index = df.index.map(lambda x: str(x).upper()[:3])
df.index.set_names("Month", inplace=True)
这是它制作的数据框
print(df)
Island St Thomas St. Croix
Year 2022 2023 2022 2023
Month
JAN 55,086 60,470 11,550 12,755
FEB 57,929 56,826 12,441 13,289
MAR 72,103 64,249 14,094 15,880
APR 67,469 56,321 12,196 13,092
MAY 60,092 49,534 13,385 16,497
JUN 67,026 56,950 14,009 15,728
JUL 66,353 61,110 13,768 16,879
AUG 50,660 42,745 10,673 12,102
SEP 24,507 25,047 6,826 6,298
OCT 34,025 34,462 10,351 9,398
NOV 44,500 NaN 9,635 NaN
DEC 58,735 NaN 12,661 NaN
我想要的是将岛屿名称作为行索引,将月份和年份串联作为列名称,从而得到 2 行 24 列的数据集。所以,第一排是圣托马斯。第一列是 JAN2022,利息值为 56086。下一列是 FEB2022,值为 57929,依此类推,直到 2023 年 12 月。第二行是圣克罗伊岛,具有相应的值和时间期间如上所述。我该怎么做?
如果我理解正确的话,
stack
和transpose
,然后展平MultiIndex列:
out = df.stack().T
out.columns = out.columns.map(lambda x: f'{x[0]}{x[1]}')