模糊逻辑来匹配数据框中的记录

Question

我有 200 万个巨大的数据集，我想根据模糊逻辑匹配记录，我有像这样的原始数据框

+---------+---------------+
|     name|        address|
+---------+---------------+
|   Arvind|      Kathmandu|
|   Arvind|      Kathmands|
|   Arbind|      Kathmandu|
|  Arvinds|      Kathmandu|
|   Arveen|      Kathmandu|
|   Arvins|      Kathmandu|
|   Arvind|Kathmandu Nepal|
| Abhishek|        Pokhara|
|Abhisheks|        Pokhara|
|Abhishek1|        Pokhara|
|Abhishek2|        Pokhara|
|Abhishek3|        Pokhara|
+---------+---------------+

我尝试使用 pyspark windows 函数，但 windows 函数根据精确匹配进行分区，我希望根据模糊逻辑匹配可能的记录，并希望将我的输出作为这样的数据框：-

+---------+---------------+
|     name|        address|uuid_for_match_record
+---------+---------------+
|   Arvind|      Kathmandu| uuid_1
|   Arvind|      Kathmands|uuid_1
|   Arbind|      Kathmandu|uuid_1
|  Arvinds|      Kathmandu|uuid_1
|   Arveen|      Kathmandu|uuid_1
|   Arvins|      Kathmandu|uuid_1
|   Arvind|Kathmandu Nepal|uuid_1
| Abhishek|        Pokhara|uuid_2
|Abhisheks|        Pokhara|uuid_2
|Abhishek1|        Pokhara|uuid_2
|Abhishek2|        Pokhara|uuid_2
|Abhishek3|        Pokhara|uuid_2`

基于200万的海量数据集是如何实现的这是我的数据框的图像以及我想要实现的目标：

Answer 1

我从这篇博文中获得灵感，编写了以下代码。

https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

cluster_names

函数只是根据

cluster_threshold

值对列表中的字符串进行聚类。您可以调整该值以获得良好的结果。您还可以在

shingling_width

中使用

name_to_features

。您可以创建 width=2,3,4,5 等的特征并将其连接在一起。

一旦您对集群感到满意，那么您可以进一步进行

fuzzywuzzy

（该库已重命名为

thefuzz

）匹配以找到更精确的匹配。

https://github.com/seatgeek/thefuzz

首先安装

simhash

python 库，然后运行以下代码。

pip install simhash

from simhash import Simhash


def simhash_distance(hash1, hash2):
    return hash1.distance(hash2)


def name_to_features(name, shingling_width=2):
    name = name.lower()
    return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]


def cluster_names(names_list, cluster_threshold=20):
    clusters_internal = []
    name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]

    for name, hash_val in name_hashes:
        found_cluster = False
        for cluster_ele in clusters_internal:
            if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
                cluster_ele['names'].append(name)
                found_cluster = True
                break
        if not found_cluster:
            clusters_internal.append({'centroid': hash_val, 'names': [name]})
    return clusters_internal


# Example usage
names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
clusters = cluster_names(names)
for i, cluster in enumerate(clusters, 1):
    print(f"Cluster {i}: {cluster['names']}")

data = [
    "Arvind Kathmandu",
    "Arvind Kathmands",
    "Arbind Kathmandu",
    "Arvinds Kathmandu",
    "Arveen Kathmandu",
    "Arvins Kathmandu",
    "Arvind Kathmandu Nepal",
    "Abhishek Pokhara",
    "Abhisheks Pokhara",
    "Abhishek1 Pokhara",
    "Abhishek2 Pokhara",
    "Abhishek3 Pokhara"
]

clusters_data = cluster_names(data)
for i, cluster in enumerate(clusters_data, 1):
    print(f"Cluster {i}: {cluster['names']}")

输出：

Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
Cluster 2: ['Bob', 'Bobb']
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']

模糊逻辑来匹配数据框中的记录

问题描述投票：0回答：1

1个回答

最新问题

模糊逻辑来匹配数据框中的记录

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1