模糊逻辑来匹配数据框中的记录

问题描述 投票:0回答:1

我有 200 万个巨大的数据集,我想根据模糊逻辑匹配记录,我有像这样的原始数据框

+---------+---------------+
|     name|        address|
+---------+---------------+
|   Arvind|      Kathmandu|
|   Arvind|      Kathmands|
|   Arbind|      Kathmandu|
|  Arvinds|      Kathmandu|
|   Arveen|      Kathmandu|
|   Arvins|      Kathmandu|
|   Arvind|Kathmandu Nepal|
| Abhishek|        Pokhara|
|Abhisheks|        Pokhara|
|Abhishek1|        Pokhara|
|Abhishek2|        Pokhara|
|Abhishek3|        Pokhara|
+---------+---------------+

我尝试使用 pyspark windows 函数,但 windows 函数根据精确匹配进行分区,我希望根据模糊逻辑匹配可能的记录,并希望将我的输出作为这样的数据框:-

+---------+---------------+
|     name|        address|uuid_for_match_record
+---------+---------------+
|   Arvind|      Kathmandu| uuid_1
|   Arvind|      Kathmands|uuid_1
|   Arbind|      Kathmandu|uuid_1
|  Arvinds|      Kathmandu|uuid_1
|   Arveen|      Kathmandu|uuid_1
|   Arvins|      Kathmandu|uuid_1
|   Arvind|Kathmandu Nepal|uuid_1
| Abhishek|        Pokhara|uuid_2
|Abhisheks|        Pokhara|uuid_2
|Abhishek1|        Pokhara|uuid_2
|Abhishek2|        Pokhara|uuid_2
|Abhishek3|        Pokhara|uuid_2`

基于200万的海量数据集是如何实现的 这是我的数据框的图像以及我想要实现的目标: image of dataframe and output i want

apache-spark pyspark fuzzywuzzy fuzzy-logic approximate-nn-searching
1个回答
0
投票

我从这篇博文中获得灵感,编写了以下代码。

https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

cluster_names
函数只是根据
cluster_threshold
值对列表中的字符串进行聚类。您可以调整该值以获得良好的结果。您还可以在
shingling_width
中使用
name_to_features
。您可以创建 width=2,3,4,5 等的特征并将其连接在一起。

一旦您对集群感到满意,那么您可以进一步进行

fuzzywuzzy
(该库已重命名为
thefuzz
)匹配以找到更精确的匹配。

https://github.com/seatgeek/thefuzz

首先安装

simhash
python 库,然后运行以下代码。

pip install simhash

from simhash import Simhash


def simhash_distance(hash1, hash2):
    return hash1.distance(hash2)


def name_to_features(name, shingling_width=2):
    name = name.lower()
    return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]


def cluster_names(names_list, cluster_threshold=20):
    clusters_internal = []
    name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]

    for name, hash_val in name_hashes:
        found_cluster = False
        for cluster_ele in clusters_internal:
            if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
                cluster_ele['names'].append(name)
                found_cluster = True
                break
        if not found_cluster:
            clusters_internal.append({'centroid': hash_val, 'names': [name]})
    return clusters_internal


# Example usage
names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
clusters = cluster_names(names)
for i, cluster in enumerate(clusters, 1):
    print(f"Cluster {i}: {cluster['names']}")

data = [
    "Arvind Kathmandu",
    "Arvind Kathmands",
    "Arbind Kathmandu",
    "Arvinds Kathmandu",
    "Arveen Kathmandu",
    "Arvins Kathmandu",
    "Arvind Kathmandu Nepal",
    "Abhishek Pokhara",
    "Abhisheks Pokhara",
    "Abhishek1 Pokhara",
    "Abhishek2 Pokhara",
    "Abhishek3 Pokhara"
]

clusters_data = cluster_names(data)
for i, cluster in enumerate(clusters_data, 1):
    print(f"Cluster {i}: {cluster['names']}")

输出:

Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
Cluster 2: ['Bob', 'Bobb']
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']
© www.soinside.com 2019 - 2024. All rights reserved.