我有两个df:
df1 = pd.DataFrame({'col1': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7'], 'col2': ["S3,S22,S44", "S133,S32,S334", "S13,S24,S45", "S1,S2,S4,S5", "S3,S4,S5", "S3,S2,S5", "S38,S42,S9"],'col3': ['ab', 'ac','ad','ae','af','as','ak'],})
df2 = pd.DataFrame({'name1': ['Ik3', 'Ik1','Ik2','Ik7','Ik5','Ik6','Ik5'], 'col2': ["S3, S44, S22,S54", "S133, S32,S334, S30", "S13, S24,S45", "S11, S21,S4, S5", "S3, S4,S5", "S3, S2,S5", "S3, S4,S9, S10, S13"],'col3': ['ab', 'ac','ad','ae','af','as','ak'],})
想比较两个df的col2列表并合并匹配度超过50%的对象,其余部分留空:
所需的输出:
pd.DataFrame({'name1': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7'], 'col2': ["S3, S22,S44", "S133, S32,S334", "S13, S2 4,S45", "S1, S2,S4 S5", "S3, S4,S5", "S3, S2,S5", "S3, S4,S9"],'col3': ['ab', 'ac','ad','ae','af','as','ak'],'nma1': ['Ik3', 'Ik1','Ik2','Ik5','Ik5','Ik6','nan'],'percentage': ['75', '50','100','50','100','100','0']})
我尝试过使用isin函数:
df1[df1.col2.isin(df2.col2)]
但未获得所需的输出。任何建议表示赞赏。
您的描述和输出不匹配。但是,这里有一些代码希望可以帮助您入门。
def get_ratios(df1, df2):
for a,b in zip(df1.col2, df2.col2):
clean = lambda s: list(map(str.strip, s.split(',')))
vals1, vals2 = clean(a), clean(b)
inter = set(vals1).intersection(vals2)
ratio = len(inter)/max(len(vals1), len(vals2))
yield ratio
s = pd.Series(get_ratios(df1, df2))
然后merge
(df1.merge(df2.rename(columns={'col1': 'nma1'}),
on=['col3'])
.assign(percentage=s)) #.where(s > 0.5)
col1 col2_x col3 nma1 col2_y percentage
0 ID1 S3, S22,S44 ab Ik3 S3, S44, S22,S54 0.750000
1 ID2 S133, S32,S334 ac Ik1 S133, S32,S334, S30 0.750000
2 ID3 S13, S2 4,S45 ad Ik2 S13, S24,S45 0.666667
3 ID4 S1, S2,S4 S5 ae Ik7 S11, S21,S4, S5 0.000000
4 ID5 S3, S4,S5 af Ik5 S3, S4,S5 1.000000
5 ID6 S3, S2,S5 as Ik6 S3, S2,S5 1.000000
6 ID7 S3, S4,S9 ak Ik5 S3, S4,S9, S10, S13 0.600000
请参阅下面的答案。我创建了一个函数来获取百分比匹配,如果百分比低于50%,则将NaN用作列nma1。谢谢。
def get_percentage(x, y):
'''
Convert columns from string to list
Compute the percentage
Return NaN if less than 50% match
'''
x=[i.strip() for i in x.split(',')]
y=[i.strip() for i in y.split(',')]
percent = int(round((100.0 * len(set(x) & set(y))) / len(set(y)),0))
return np.NaN if percent < 50 else percent
# Use merge using both index
df = pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('', '_y')).rename(columns={"col1": "name1", "col1_y": "nma1"})
# Get the percentage using apply/lambda functions
df['percent'] = df.apply(lambda x: get_percentage(x.col2, x.col2_y), axis=1)
# Remove not needed columns
df.drop(columns=['col2_y', 'col3_y'], inplace=True)
# Check if percent column is NaN
df['nma1']=df.apply(lambda x: np.NaN if np.isnan(x.percent) else x.nma1, axis=1)
df
结果:
name1 col2 col3 nma1 percent
0 ID1 S3, S22,S44 ab Ik3 75.0
1 ID2 S133, S32,S334 ac Ik1 75.0
2 ID3 S13, S2 4,S45 ad Ik2 67.0
3 ID4 S1, S2,S4 S5 ae NaN NaN
4 ID5 S3, S4,S5 af Ik5 100.0
5 ID6 S3, S2,S5 as Ik6 100.0
6 ID7 S3, S4,S9 ak Ik5 60.0