当其他值相同时,此代码将 X 列分组并聚合 Y 列在一个列表中。我想做同样的事情,聚合成一个列表,但基于条件:
如您在示例中所见,我从 4 行中获得 2 行,因为它聚合了 age 列。我的条件是,如果垃圾箱彼此相邻,我只想聚合它们。
换句话说:bin[0,20]只能与bin[21,40]聚合,bin[21,40]可以与[0,20]和[41,60]聚合,依此类推。 ..
为此我使用了 .agg 方法。
欢迎任何想法或建议。
# Import pandas library
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({
'country': ['1', '1', '1', '1'],
'age': [5, 25, 45, 70],
'gender': ['M', 'M', 'M', 'M'],
'language': ['A', 'B', 'B', 'A']
})
# Define the age bins with custom groups
age_bins = [(0, 20), (21, 40), (41, 60), (61, 100)]
age_labels = ['0-20', '21-40', '41-60', '61-100']
# Define a custom function to group the data based on adjacent age ranges
def custom_age_group(age):
for i, (start, end) in enumerate(age_bins):
if age >= start and age <= end:
return age_labels[i]
return 'Unknown'
# Apply the custom function to create a new column with the custom age groups
df['age_group'] = df['age'].apply(custom_age_group)
# Group the data based on country, gender, and custom age group
df_out = df.groupby(['gender', 'country','language'])['age_group'].agg(list).reset_index()
# Print the result
print(df_out)
您可以通过提取开始停止并将开始与每组上一行的停止进行比较来使用额外的分组级别:
cols = ['gender', 'country','language']
bins = df['age_group'].str.split('-', expand=True).astype(int)
g = (df.join(bins)
.groupby(cols, group_keys=False)
.apply(lambda g: g[0].ne(g[1].shift().add(1)).cumsum())
)
out = df.groupby(cols+[g], as_index=False)['age_group'].agg(list)
输出:
gender country language age_group
0 M 1 A [0-20]
1 M 1 A [61-100]
2 M 1 B [21-40, 41-60]
中间体:
bins
0 1
0 0 20
1 21 40
2 41 60
3 61 100
g
0 1
1 1
2 1
3 2
dtype: int64