我有 200 万个巨大的数据集,我想根据模糊逻辑匹配记录,我有像这样的原始数据框
+---------+---------------+
| name| address|
+---------+---------------+
| Arvind| Kathmandu|
| Arvind| Kathmands|
| Arbind| Kathmandu|
| Arvinds| Kathmandu|
| Arveen| Kathmandu|
| Arvins| Kathmandu|
| Arvind|Kathmandu Nepal|
| Abhishek| Pokhara|
|Abhisheks| Pokhara|
|Abhishek1| Pokhara|
|Abhishek2| Pokhara|
|Abhishek3| Pokhara|
+---------+---------------+
我尝试使用 pyspark windows 函数,但 windows 函数根据精确匹配进行分区,我希望根据模糊逻辑匹配可能的记录,并希望将我的输出作为这样的数据框:-
+---------+---------------+
| name| address|uuid_for_match_record
+---------+---------------+
| Arvind| Kathmandu| uuid_1
| Arvind| Kathmands|uuid_1
| Arbind| Kathmandu|uuid_1
| Arvinds| Kathmandu|uuid_1
| Arveen| Kathmandu|uuid_1
| Arvins| Kathmandu|uuid_1
| Arvind|Kathmandu Nepal|uuid_1
| Abhishek| Pokhara|uuid_2
|Abhisheks| Pokhara|uuid_2
|Abhishek1| Pokhara|uuid_2
|Abhishek2| Pokhara|uuid_2
|Abhishek3| Pokhara|uuid_2`
我从这篇博文中获得灵感,编写了以下代码。
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
cluster_names
函数只是根据cluster_threshold
值对列表中的字符串进行聚类。您可以调整该值以获得良好的结果。您还可以在 shingling_width
中使用 name_to_features
。您可以创建 width=2,3,4,5 等的特征并将其连接在一起。
一旦您对集群感到满意,那么您可以进一步进行
fuzzywuzzy
(该库已重命名为thefuzz
)匹配以找到更精确的匹配。
https://github.com/seatgeek/thefuzz
首先安装
simhash
python 库,然后运行以下代码。
pip install simhash
from simhash import Simhash
def simhash_distance(hash1, hash2):
return hash1.distance(hash2)
def name_to_features(name, shingling_width=2):
name = name.lower()
return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]
def cluster_names(names_list, cluster_threshold=20):
clusters_internal = []
name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]
for name, hash_val in name_hashes:
found_cluster = False
for cluster_ele in clusters_internal:
if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
cluster_ele['names'].append(name)
found_cluster = True
break
if not found_cluster:
clusters_internal.append({'centroid': hash_val, 'names': [name]})
return clusters_internal
# Example usage
names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
clusters = cluster_names(names)
for i, cluster in enumerate(clusters, 1):
print(f"Cluster {i}: {cluster['names']}")
data = [
"Arvind Kathmandu",
"Arvind Kathmands",
"Arbind Kathmandu",
"Arvinds Kathmandu",
"Arveen Kathmandu",
"Arvins Kathmandu",
"Arvind Kathmandu Nepal",
"Abhishek Pokhara",
"Abhisheks Pokhara",
"Abhishek1 Pokhara",
"Abhishek2 Pokhara",
"Abhishek3 Pokhara"
]
clusters_data = cluster_names(data)
for i, cluster in enumerate(clusters_data, 1):
print(f"Cluster {i}: {cluster['names']}")
输出:
Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
Cluster 2: ['Bob', 'Bobb']
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']