复杂的熊猫数据框融化

问题描述 投票:0回答:1

我有一个 excel 表,其中包含以下格式的一些数据:

    Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  Unnamed: 10 Unnamed: 11 Unnamed: 12
0   Hour 1  FI  NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
1   D2 ND: 17-12-2022   7258    2101.2  751.334 567.917 418.5   35.1    1370.9  1254.971    1854.434    1584.931    1396.1  1633.6
2   D1 ND: 16-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
3   D1 ND: 15-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
4   D1 ND: 14-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
5   D1 ND: 13-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
6   D1 ND: 10-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
7   Selected: 16-12-2022:7  4885.746    1960.018    -799.833    -76.084 628 -38.1   1356.89 1537.029    1730.735    1412.038    1960.2  1878.3

然后以 8 行 13 列的相同块格式重复 24 小时,如下所示:

    Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  Unnamed: 10 Unnamed: 11 Unnamed: 12
8   Hour 2  FI  NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
9   D2 ND: 17-12-2022   7178.9  2167.879    785.333 524.5   452.7   -29.3   1346.132    1151.952    1818.116    1580.202    1144.8  1578.2
10  D1 ND: 16-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
11  D1 ND: 15-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
12  D1 ND: 14-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
13  D1 ND: 13-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
14  D1 ND: 10-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
15  Selected: 16-12-2022:7  4885.746    1960.018    -799.833    -76.084 628 -38.1   1356.89 1537.029    1730.735    1412.038    1960.2  1878.3

我想用以下列将其转换为长格式:

pd.Dataframe(columns = ['Datetime','BZ','type','Horizon (D1 or D2)','Value','Day'])
  • 日期时间是文档名称中的日期,小时是从第一列第一行开始的第 1 小时和第 2 小时,每第 7 个条目。
  • BZ 第一行:FI, NO2 等
  • type 是 ND 或 'Selected'
  • 地平线是 D1 或 D2
  • 值是所有数字
  • Day 是第二列的日期

到目前为止我做了什么?

我真的不确定攻击这个 tbh 的最佳方法。我考虑过将数据框拆分为与列相关的数据框或数组,并将它们用作

id_vars
中的
pd.melt()

[j for j in df.iloc[:,0] if str(j).startswith('Selected')]
[j for j in df.iloc[:,0] if str(j).startswith('D1')]
[j for j in df.iloc[:,0] if str(j).startswith('D2')]

但这行不通。

我想问题的根本是,当您需要从单个单元格、多行和循环中提取多个变量时,如何融化数据框!?

python pandas melt long-format-data
1个回答
0
投票

您可以通过首先重塑数据然后使用 pd.melt() 转换它来实现这一点。

创建一个函数以从“未命名:0”列中提取小时:

def extract_hour(s):
    if s.startswith('Hour'):
        return int(s.split(' ')[1])
    else:
        return None

将此函数应用于“未命名:0”列并向前填充缺失值

df['Hour'] = df['Unnamed: 0'].apply(extract_hour)
df['Hour'] = df['Hour'].fillna(method='ffill')

然后

df = df[~df['Unnamed: 1'].isin(['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4'])]
df['type'] = df['Unnamed: 0'].apply(lambda x: 'Selected' if x.startswith('Selected') else 'ND')
df['Horizon'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[0])
df['Day'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[-1])
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 1'])
df = df.reset_index(drop=True)
BZ = ['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4']
df_long = pd.melt(df, id_vars=['Hour', 'type', 'Horizon', 'Day'], value_vars=BZ, var_name='BZ', value_name='Value')
df_long['Datetime'] = pd.to_datetime(df_long['Day'], format='%d-%m-%Y') + pd.to_timedelta(df_long['Hour'] - 1, unit='h')
df_long = df_long[['Datetime', 'BZ', 'type', 'Horizon', 'Value', 'Day']]

输出应该是这样的:

Datetime    BZ      type Horizon     Value         Day
0 2022-12-17 00:00:00    FI        ND      D2    7258.0  17-12-2022
1 2022-12-16 00:00:00    FI        ND      D1  3702.878  16-12-2022
2 2022-12-15 00:00:00    FI        ND      D1  3702.878  15-12-2022
3 2022-12-14 00:00:00    FI        ND      D1  3702.878  14-12-2022
4 2022-12-13 00:00:00    FI        ND      D1  3702.878  13-12-2022
5 2022-12-10 00:00:00    FI        ND      D1  3702.878  10-12-2022
6 2022-12-16 00:00:00    FI  Selected       7  4885.746  16-12-2022
7 2022-12-17 01:00:00    FI        ND      D2    7178.9  17-12-2022
8 2022-12-16 01:00:00    FI        ND      D1   3641.45  16-12-2022
9 2022-12-15 01:00:00    FI        ND      D1   3641.45  15-12-2022
© www.soinside.com 2019 - 2024. All rights reserved.