我有以下(示例)数据集:
>>> pd.DataFrame([["001", "Apple"],["002","Strawberry"],["001", None],["002","Strawberry"], ["003", "Apple"],["003","Strawberry"],], columns = ["Deal", "Product"])
Deal Product
0 001 Apple
1 002 Strawberry
2 001
3 002 Strawberry
4 003 Apple
5 003 Strawberry
我想将与订单链接的所有产品归为一组,如下所示:
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Strawberry, Apple}
我有一个尝试的解决方案,是下面的答案之一,我想了解我是否以正确的方式(pythonic,最快)进行此操作
我从this answer开始解决我的问题
# Turn every element in product in a set of one or zero elements
# Also Ensure we don't have null values
df["Product"] = df["Product"].apply(lambda val: {val} if val not in [None, ""] else {})
#Use the answer mentioned to bring together the single sets
df.groupby('Deal').agg({'Product':lambda x: set.union(*x)}).reset_index('Deal')
最终结果:
>>> df
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Strawberry, Apple}
您可以将SeriesGroupBy.agg
和pd.Series.dropna
与set
一起使用。
df.groupby('Deal')['Product'].agg(lambda x:set(x.dropna()))
Deal
001 {Apple}
002 {Strawberry}
003 {Strawberry, Apple}
Name: Product, dtype: object
想法是删除缺少的值:
mask = df['Product'].notna() | df['Product'].ne('')
df = df[mask].groupby('Deal')['Product'].agg(set).reset_index()
print (df)
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Apple, Strawberry}
您可以使用notnull
:
print (df.loc[df['Product'].notnull()].groupby("Deal")["Product"].apply(set))
Deal
001 {Apple}
002 {Strawberry}
003 {Strawberry, Apple}
如果要同时处理df.loc[~df['Product'].isin([None,""])]...
和None
,请使用''
。或使用其他答案中的方法进行过滤。