我有一个 excel 表,其中包含以下格式的一些数据:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Hour 1 FI NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
1 D2 ND: 17-12-2022 7258 2101.2 751.334 567.917 418.5 35.1 1370.9 1254.971 1854.434 1584.931 1396.1 1633.6
2 D1 ND: 16-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
3 D1 ND: 15-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
4 D1 ND: 14-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
5 D1 ND: 13-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
6 D1 ND: 10-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
7 Selected: 16-12-2022:7 4885.746 1960.018 -799.833 -76.084 628 -38.1 1356.89 1537.029 1730.735 1412.038 1960.2 1878.3
然后以 8 行 13 列的相同块格式重复 24 小时,如下所示:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
8 Hour 2 FI NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
9 D2 ND: 17-12-2022 7178.9 2167.879 785.333 524.5 452.7 -29.3 1346.132 1151.952 1818.116 1580.202 1144.8 1578.2
10 D1 ND: 16-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
11 D1 ND: 15-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
12 D1 ND: 14-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
13 D1 ND: 13-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
14 D1 ND: 10-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
15 Selected: 16-12-2022:7 4885.746 1960.018 -799.833 -76.084 628 -38.1 1356.89 1537.029 1730.735 1412.038 1960.2 1878.3
我想用以下列将其转换为长格式:
pd.Dataframe(columns = ['Datetime','BZ','type','Horizon (D1 or D2)','Value','Day'])
到目前为止我做了什么?
我真的不确定攻击这个 tbh 的最佳方法。我考虑过将数据框拆分为与列相关的数据框或数组,并将它们用作
id_vars
中的pd.melt()
:
[j for j in df.iloc[:,0] if str(j).startswith('Selected')]
[j for j in df.iloc[:,0] if str(j).startswith('D1')]
[j for j in df.iloc[:,0] if str(j).startswith('D2')]
但这行不通。
我想问题的根本是,当您需要从单个单元格、多行和循环中提取多个变量时,如何融化数据框!?
您可以通过首先重塑数据然后使用 pd.melt() 转换它来实现这一点。
创建一个函数以从“未命名:0”列中提取小时:
def extract_hour(s):
if s.startswith('Hour'):
return int(s.split(' ')[1])
else:
return None
将此函数应用于“未命名:0”列并向前填充缺失值
df['Hour'] = df['Unnamed: 0'].apply(extract_hour)
df['Hour'] = df['Hour'].fillna(method='ffill')
然后
df = df[~df['Unnamed: 1'].isin(['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4'])]
df['type'] = df['Unnamed: 0'].apply(lambda x: 'Selected' if x.startswith('Selected') else 'ND')
df['Horizon'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[0])
df['Day'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[-1])
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 1'])
df = df.reset_index(drop=True)
BZ = ['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4']
df_long = pd.melt(df, id_vars=['Hour', 'type', 'Horizon', 'Day'], value_vars=BZ, var_name='BZ', value_name='Value')
df_long['Datetime'] = pd.to_datetime(df_long['Day'], format='%d-%m-%Y') + pd.to_timedelta(df_long['Hour'] - 1, unit='h')
df_long = df_long[['Datetime', 'BZ', 'type', 'Horizon', 'Value', 'Day']]
输出应该是这样的:
Datetime BZ type Horizon Value Day
0 2022-12-17 00:00:00 FI ND D2 7258.0 17-12-2022
1 2022-12-16 00:00:00 FI ND D1 3702.878 16-12-2022
2 2022-12-15 00:00:00 FI ND D1 3702.878 15-12-2022
3 2022-12-14 00:00:00 FI ND D1 3702.878 14-12-2022
4 2022-12-13 00:00:00 FI ND D1 3702.878 13-12-2022
5 2022-12-10 00:00:00 FI ND D1 3702.878 10-12-2022
6 2022-12-16 00:00:00 FI Selected 7 4885.746 16-12-2022
7 2022-12-17 01:00:00 FI ND D2 7178.9 17-12-2022
8 2022-12-16 01:00:00 FI ND D1 3641.45 16-12-2022
9 2022-12-15 01:00:00 FI ND D1 3641.45 15-12-2022