我想计算两个数据帧(dfa & dfb)之间的Levenshtein距离,如下所示。
dfa:
Name Addresss ID
Name1a Address1a ID1a
Name2a Address2a ID2a
dfb:
Name Addresss ID
Name1b Address1b ID1b
Name2b Address2b ID2b
我理解计算两个字符串之间的距离,但我有点困惑,我如何能做一组列与另一组列的对比,输出看起来像这样的东西,它显示所有的对和分数。
输出:
Name Name LevScore
Name1a Name1b 0.87
Name1a Name2b 0.45
Name1a Name3b 0.26
Name2a Name1b 0.92
Name2a Name2b 0.67
Name2a Name3b 0.56
etc
先谢谢你
马内什
你可以使用软件包 Levenshtein
连同 itertools
得到两列的值的组合。
import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
编辑
假设这是你的df1。
df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name Addresss ID
0 Name1a Address1a ID1a
1 Name2a Address2a ID2a
2 Name1a Address1a ID1a
3 Name2a Address2a ID2a
4 Name1a Address1a ID1a
5 Name2a Address2a ID2a
就像你说的那样,你可以计算出大小为块的值组合 step
从 df1_n
:
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)
输出。
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
4 Name1a Name1b 1
5 Name1a Name2b 2
6 Name2a Name1b 2
7 Name2a Name2b 1
8 Name1a Name1b 1
9 Name1a Name2b 2
10 Name2a Name1b 2
11 Name2a Name2b 1
根据你的情况把2改成300或500. 这应该可以避免填满你的整个内存,让我知道它是否有效!
试试这个
import pandas as pd
from textdistance import levenshtein
from itertools import product
# dfa = pd.read_clipboard() # this is just to reproduce your dataframe
# dfb = pd.read_clipboard() # this is just to reproduce your dataframe
dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])
dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
x['Name2']), axis=1)
Name1 Name2 Distance
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1