我有一个数据框,其中有一个键列、一些值列和一些时间戳列。对于某些键,可能有多行在值和时间戳列中具有不同的值。
我想找到多行的键,特定键具有不同值的列,然后我想得到一个聚合,其中对于一个键,具有不同值的列被求和或平均以及最大值或最小值为时间戳列选择值。样本数据如下:
data = [['key1', 10, 10, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), 'A'],
['key1', 10, 20, 10, pd.Timestamp('2024-05-11'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-06')],
['key1', 10, 30, 10, pd.Timestamp('2024-05-11'), pd.Timestamp('2024-05-08'), pd.Timestamp('2024-05-12')],
['key2', 10, 10, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09')],
['key2', 12, 10, 10, pd.Timestamp('2024-05-13'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09')],
['key3', 14, 11, 17, pd.Timestamp('2024-06-09'), pd.Timestamp('2024-05-04'), pd.Timestamp('2024-05-01')],
['key4', 10, 10, 12, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-11'), pd.Timestamp('2024-05-29')],
['key5', 10, 10, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-11')],
['key5', 10, 10, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-11')],
['key5', 12, 11, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-11')],
['key5', 10, 11, 10, pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-09'), pd.Timestamp('2024-05-11')]
]
columns = ['Key', 'Value1', 'Value2', 'Value3', 'Timestamp1', 'Timestamp2', 'Timestamp3', 'ID']
sample_df = pd.DataFrame(data, columns=columns)
实际文件有数百万行和数百列。
我想得到一个看起来像这样的输出
钥匙 | 行数 | 价值1 | 价值2 | 时间戳1 | 时间戳2 | 时间戳3 | 身份证 |
---|---|---|---|---|---|---|---|
钥匙1 | 3 | 10 | 35 | '2024-05-09' | '2024-05-09' | '2024-05-12' | 'A' |
钥匙2 | 2 | 22 | 10 | '2024-05-09' | '2024-05-09' | '2024-05-09' | 'B' |
钥匙5 | 4 | 42 | 44 | '2024-05-09' | '2024-05-09' | '2024-05-11' | 'B' |
这里,如果列中的值相同,则通过求平均值来聚合;如果列中的值不同,则通过求和来聚合。时间戳根据时间戳 1 的最小值、时间戳 2 和时间戳 3 的最大值进行聚合。对于Id列,它们按Value1列排序,并取最大Value1对应的ID
由于 value3 在键的所有实例中具有相同的值,因此它不包含在最终表中。
我已经能够在我拥有键的地方完成中途,只有那些值发生变化的列。
multiple_rows_sample = sample_df.groupby(['Key']).size().reset_index(name='counts')
multiple_rows_sample = multiple_rows_sample[multiple_rows_sample['counts']>1]
mult_val_cols_sample = pd.DataFrame()
for index, row in multiple_rows_sample.iterrows():
joined_slice = sample_df[(sample_df['Key']==row['Key'])]
count_slice = row.to_frame().transpose().reset_index(drop=True)
count_slice['key']=1
diff_cols = cols_having_unique(joined_slice)
diff_cols['key']=1
output_df = pd.merge(count_slice, diff_cols, how='outer')
output_df = output_df.drop('key', axis=1)
mult_val_cols_sample = pd.concat([mult_val_cols_sample, output_df], ignore_index=True)
mult_val_cols_sample
表包含键列,并且仅包含至少一个键的值发生更改的那些列。
现在,当我事先不知道这些列的名称时,如何在这些列上运行 groupby?
如有任何帮助,我们将不胜感激。
groupby.agg
并对输出进行后处理:
idxmax
)g = sample_df.groupby('Key', as_index=False)
def avg_or_sum(vals):
if vals.nunique() == 1:
return vals.mean()
else:
return vals.sum()
out = (g
# aggregate with custom functions
.agg(**{'CountOfrows': ('Key', 'size'),
'Value1': ('Value1', avg_or_sum),
'Value2': ('Value2', avg_or_sum),
'Value3': ('Value3', avg_or_sum),
'Timestamp1': ('Timestamp1', 'min'),
'Timestamp2': ('Timestamp2', 'max'),
'Timestamp3': ('Timestamp3', 'max'),
'ID': ('ID', 'idxmax') # this will need to be post-processed
})
# only keep the rows with more than 1 item
.query('CountOfrows > 1')
# filter out the columns with all identical values within all groups
.loc[:, lambda x: g.nunique().ne(1).any()
.reindex(x.columns, fill_value=True)]
# replace the index with the actual ID
.assign(ID=lambda d: d['ID'].map(sample_df['ID']))
)
输出:
Key CountOfrows Value1 Value2 Timestamp1 Timestamp2 Timestamp3 ID
0 key1 3 10.0 60.0 2024-05-09 2024-05-09 2024-05-12 A
1 key2 2 22.0 10.0 2024-05-09 2024-05-09 2024-05-09 NaN
4 key5 4 42.0 42.0 2024-05-09 2024-05-09 2024-05-11 NaN
请注意,如果您有数百列,则需要生成以编程方式传递给
agg
的字典。