我有以下 DataFrame df
id1 id2 text_column
key1 220 ABC corp
key1 220 ABC Pvt Ltd
key2 300 PQR Ltd
key2 300 PQR
key2 300 PQR something else
key2 400 XYZ company
我不知道Text_Column中会有什么样的文本,但是对于相同的id1和id2,我想识别“text_column”中的相似字符串并将这些字符串替换为标准文本。我想要文本text_column 被标准化类似于下面
id1 id2 text_column
key1 220 ABC corp
key1 220 ABC corp
key2 300 PQR
key2 300 PQR
key2 300 PQR
key2 400 XYZ company
使用以下代码计算相似度分数 i m
import pandas as pd
import numpy as np
from fuzzywuzzy import process,fuzz
df=pd.DataFrame(data={"id1":["key1","key1","key2","key2","key2","key2"],"id2":[220,220,300,300,300,400],"text_column":["ABC corp","ABC Pvt Ltd","PQR Ltd","PQR","PQR something else","XYZ company"]})
filters=['id1','id2']
df_text_column=df.groupby(filters)['text_column'].apply(list).reset_index().rename(columns={"text_column":"lst_text_column"})
lst_temp=df_text_column.columns.difference(df.columns).union(filters).to_list()
df = pd.merge(df, df_text_column[lst_temp], on = filters, how = 'left')
df['text_column smilarity score']=df.apply(lambda x:process.extract(x['text_column'],x['lst_text_column']),axis=1)
df['text_column smilarity score']=df['text_column smilarity score'].apply(lambda x:[i[1] for i in x])
df['text_column smilarity score min']=df['text_column smilarity score'].apply(lambda x:np.min(x))
我希望 text_column 中的文本以与上述相同的方式针对相似的字符串进行标准化。这样我就可以使用 groupby 中的 text_column 来执行进一步的计算
rapidfuzz 值得尝试这种类型的东西。
.process.cdist
一次比较所有字符串 - 与其他方法相比,它的性能要好得多。
workers=-1
将使用所有 CPU 内核。
>>> pd.DataFrame(scores)
0 1 2 3 4 5
0 1.000000 0.621212 0.422619 0.000000 0.453704 0.621212
1 0.621212 1.000000 0.589610 0.474747 0.445286 0.393939
2 0.422619 0.589610 1.000000 0.866667 0.664021 0.411255
3 0.000000 0.474747 0.866667 1.000000 0.805556 0.000000
4 0.453704 0.445286 0.664021 0.805556 1.000000 0.528620
5 0.621212 0.393939 0.411255 0.000000 0.528620 1.000000
结果将是所有分数的“矩阵”,但我们可以使用
.groupby().indices
计算出每个组的值。
>>> pd.DataFrame(scores)
0 1 2 3 4 5
0 1.000000 0.621212 NaN NaN NaN NaN
1 0.621212 1.000000 NaN NaN NaN NaN
2 NaN NaN 1.000000 0.866667 0.664021 NaN
3 NaN NaN 0.866667 1.000000 0.805556 NaN
4 NaN NaN 0.664021 0.805556 1.000000 NaN
5 NaN NaN NaN NaN NaN 1.0
从这个结果我们可以看到我们的“组”是行:
0, 1
2, 3, 4
5
即非 nan 列的索引
完整代码示例:
import rapidfuzz
import pandas as pd
import numpy as np
df = pd.DataFrame({
"id1": ["key1", "key1", "key2", "key2", "key2", "key2"],
"id2": [220, 220, 300, 300, 300, 400],
"text_column": [
"ABC corp", "ABC Pvt Ltd", "PQR Ltd", "PQR",
"PQR something else", "XYZ company"
]
})
keys = ["id1", "id2"]
min_score = 0.6
scores = (
rapidfuzz.process.cdist(
df["text_column"],
df["text_column"],
scorer=rapidfuzz.distance.JaroWinkler.similarity,
workers=-1
)
)
# indices of values to "remove" - will set to NaN
mask = (
[np.repeat(rows, len(cols)), np.tile(cols, len(rows))]
for key, rows in df.groupby(keys).indices.items()
for cols in [
np.setdiff1d(np.arange(df.shape[0]), rows)
]
)
rows, cols = zip(*mask)
rows = np.hstack(rows)
cols = np.hstack(cols)
# "remove" non-group values
scores[rows, cols] = np.nan
# "remove" any matches below score
scores[scores < min_score] = np.nan
# get the indices of remining matches
row, col = np.where(~np.isnan(scores))
matches = pd.DataFrame(dict(row=row, col=col)).groupby("row").agg(tuple)
df["standard"] = df.groupby(matches["col"])["text_column"].transform("first")
id1 id2 text_column standard
0 key1 220 ABC corp ABC corp
1 key1 220 ABC Pvt Ltd ABC corp
2 key2 300 PQR Ltd PQR Ltd
3 key2 300 PQR PQR Ltd
4 key2 300 PQR something else PQR Ltd
5 key2 400 XYZ company XYZ company