我有一个df,如下所示。我正在尝试根据主机列的值找到行的交集。
host values
test ['A','B','C','D']
test ['D','E','B','F']
prod ['1','2','A','D','E']
prod []
prod ['2']
预期输出是一行与下一行的交集 如果主机值相同。对于上述df,输出为
test=['B','D'] - intersection of row 1 and 2
prod=[] - intersection of row 3 and 4
prod=[] - intersection of row 4 and 5
第2行和第3行的交集不执行,因为主机列值不匹配。任何帮助表示赞赏。
不确定所需结果的结构,但是您可以使用shift
为每组主机创建一列。然后使用apply
,其中此新列为notna
并进行set
s的交集。
df['val_shift'] = df.groupby('host')['values'].shift()
df['intersect'] = df[df['val_shift'].notna()]\
.apply(lambda x: list(set(x['values'])&set(x['val_shift'])), axis=1)
print (df)
host values val_shift intersect
0 test [A, B, C, D] NaN NaN
1 test [D, E, B, F] [A, B, C, D] [B, D]
2 host [1, 2, A, D, E] NaN NaN
3 host [] [1, 2, A, D, E] []
4 host [2] [] []
可以通过自定义功能将df.groupby
和SeriesGroupBy.apply
一起使用。
def f(s):
s = pd.concat([s,s.shift(-1)],axis=1).dropna(how='any')
return s.apply(lambda x:f'{set(x[0])&set(x[1])} between row {x.name+1} and {x.name+2}',axis=1)
df.groupby('host')['values'].apply(f)
host
prod 2 set() between row 3 and 4
3 set() between row 4 and 5
test 0 {'D', 'B'} between row 1 and 2
Name: values, dtype: object
# If you don't want index
# df.groupby('host')['values'].apply(f).reset_index(drop=True)
# 0 set() between 3 and 4
# 1 set() between 4 and 5
# 2 {'D', 'B'} between 1 and 2
# Name: values, dtype: object
要获得[]
和['D', 'B']
而不是set()
和{'D', 'B'}
的输出,请尝试此。
def f(s):
s = pd.concat([s,s.shift(-1)],axis=1).dropna(how='any')
return s.apply(lambda x:f'{[*(set(x[0])&set(x[1]))]} between row {x.name+1} and {x.name+2}',axis=1)
df.groupby('host')['values'].apply(f).reset_index(drop=True)
0 [] between 3 and 4
1 [] between 4 and 5
2 ['D', 'B'] between 1 and 2
Name: values, dtype: object