Pandas 层次结构/删除重复项的复杂条件

问题描述 投票:0回答:1

我有一个大型位置数据集 - 下面有一个示例。

我需要查找并删除重复项。有一些明显的重复项,它们具有相同的地址、名称和行政区,但也有一些几乎相同的行,它们标识 2 个不同的位置(例如下面示例的最后 2 行)。 让事情变得复杂的是,大多数位置都缺少名称,所以我只有地址和行政区,而且地址并不总是有数字。

行政区也不一致,但我找到了一种简化它的方法(见下文)。

addresses = ['regents street', '21 regent street',
             'bishopgate 3', '3 bishopgate', 'bishop gate', 'regent',
             'hill park', 'hill park road', '10 hill park road',
             '12 hill park', 'south street', 'south street', 'east street', '2 east street', 'cup street', 'bond street',
            '80 cobbler road', '88 cobbler road']

name = ['a','a','b','b','b','c','d','e','e','d', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'g', 'h']
boroughs = ['royal borough of greenwhich', 'borough of greenwhich', 
            'royal borough of chelsea', 'chelsea', 'borough of chelsea',
            'greenwhich', 'haringey', 'haringey', 'borough of haringey', 'haringey',
           'southwark', 'southwark', 'hammersmith', 'borough of hammersmith', 
            'hackney', 'hackney', 'lambeth', 'lambeth']


df = pd.DataFrame({'address':addresses, 'name':name, 'borough':boroughs})
# to simplify the borough columns i do this
df['borough'] = df['borough'].str.replace('royal|royal borough of|borough of', '', regex=True)

我虽然也可以通过以下方式清理地址:

df['address'] = df['address'].str.replace('\d+', '', regex=True)
但随后我丢失了同一条街道上不同位置的信息。

关于如何做到这一点有什么想法吗? (我尝试过诸如 groupby([...]).max() 但没有成功)。

附注如果一个地址仅以不同的格式出现一次(例如“东街”和“东街 2 号”),我将假设它们指的是同一位置,因此只应保留一条记录。

python python-3.x pandas dataframe numpy
1个回答
0
投票

正如 @mozway 在评论中指出的那样,没有内置功能可以让您执行此操作,并且正如他所指出的,您需要编写自己的逻辑。我的看法是通过使用相似性分数来工作(您可以自己决定下面代码中的阈值)。以下是如何执行此操作的示例:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

addresses = ['regents street', '21 regent street',
             'bishopgate 3', '3 bishopgate', 'bishop gate', 'regent',
             'hill park', 'hill park road', '10 hill park road',
             '12 hill park', 'south street', 'south street', 'east street', '2 east street', 'cup street', 'bond street',
            '80 cobbler road', '88 cobbler road']

name = ['a','a','b','b','b','c','d','e','e','d', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'g', 'h']
boroughs = ['royal borough of greenwhich', 'borough of greenwhich', 
            'royal borough of chelsea', 'chelsea', 'borough of chelsea',
            'greenwhich', 'haringey', 'haringey', 'borough of haringey', 'haringey',
           'southwark', 'southwark', 'hammersmith', 'borough of hammersmith', 
            'hackney', 'hackney', 'lambeth', 'lambeth']


df = pd.DataFrame({'address': addresses, 'name': name, 'borough': boroughs})
df['borough'] = df['borough'].str.replace('royal|royal borough of|borough of', '', regex=True).str.strip()


vectorizer = TfidfVectorizer(min_df=1, analyzer='word', ngram_range=(1,3))
tfidf_matrix = vectorizer.fit_transform(df['address'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

threshold = 0.8  
rows_to_drop = set()

for i in range(len(cosine_sim)):
    for j in range(i+1, len(cosine_sim)):
        if cosine_sim[i, j] > threshold:
            row_to_keep = i if pd.isna(df.iloc[j]['name']) else j
            rows_to_drop.add(j if row_to_keep == i else i)

df_deduplicated = df.drop(list(rows_to_drop))

print(df_deduplicated)

这会给你这样的东西:

              address name      borough
0      regents street    a   greenwhich
1    21 regent street    a   greenwhich
3        3 bishopgate    b      chelsea
4         bishop gate    b      chelsea
5              regent    c   greenwhich
6           hill park    d     haringey
7      hill park road    e     haringey
8   10 hill park road    e     haringey
9        12 hill park    d     haringey
10       south street  NaN    southwark
12        east street  NaN  hammersmith
14         cup street  NaN      hackney
15        bond street  NaN      hackney
16    80 cobbler road    g      lambeth
17    88 cobbler road    h      lambeth
© www.soinside.com 2019 - 2024. All rights reserved.