基于Python中的空间聚类方法填补缺失值

问题描述 投票:0回答:1

给定一个数据帧如下。

     latitude   longitude         user_service
0  -27.496404  153.014353      02: Duhig Tower
1  -27.497107  153.014836                  NaN
2  -27.497118  153.014890                  NaN
3  -27.497154  153.014813                  NaN
4  -27.496437  153.014477      12: Duhig North
5  -27.497156  153.014813  32: Gordon Greenwod
6  -27.497097  153.014746       23: Abel Smith
7  -27.496390  153.014415  32: Gordon Greenwod
8  -27.497112  153.014780           03: Steele
9  -27.497156  153.014813  32: Gordon Greenwod
10 -27.496487  153.014622      02: Duhig Tower
11 -27.497075  153.014532                  NaN
12 -27.497103  153.014817        25: UQ Sports
13 -27.496754  153.014504      02: Duhig Tower
14 -27.496567  153.014294      02: Duhig Tower
15 -27.497156  153.014813  32: Gordon Greenwod

由于 user_service 列有缺失的值,所以我想也许可以用空间聚类的方法来填充nans。

例如,对于 latitudelongitude 对子 -27.497107, 153.014836 在第二行,如果 02: Duhig Tower的位置是最接近它的距离,所以我想把楠楠填进去。user_service02: Duhig Tower 这条记录。其他缺失的行也是同样的逻辑。

如何用Python实现上述逻辑?谢谢。

Guillermo Mosse的解决方案的输出,但仍然有一些 NaNs:

     latitude   longitude                   user_service
0  -27.499012  153.015180               51: Zelman Cowen
1  -27.497600  153.014479                     03: Steele
2  -27.500054  153.013435         50: Hawken Engineering
3  -27.495979  153.009834                            NaN
4  -27.496748  153.017507            32: Gordon Greenwod
5  -27.495695  153.016178  38: UQ Multi Faith Chaplaincy
6  -27.497015  153.012492               01: Forgan Smith
7  -27.498797  153.017267                            NaN
8  -27.500508  153.011360                       75: AIBN
9  -27.496763  153.013795               01: Forgan Smith
10 -27.494909  153.017187                            NaN
11 -27.496384  153.013810                12: Duhig North

检查 NaNs:

var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))

s: Out:

2    NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1
python-3.x pandas scikit-learn k-means dbscan
1个回答
1
投票

理想情况下,你会希望使用潘达的。插值 用一个自定义的距离函数来填充NaN值,但该方法似乎没有任何方式可以扩展。

一个可能的解决方案是,对于每一个数据点,得到最接近的数据点的服务名,而这个服务名实际上有一个服务名。这里是一个完整的工作示例,是一个可能的解决方案。

import pandas as pd
from scipy.spatial.distance import cdist
import numpy as np

df = pd.DataFrame    ([
  [-27.496404,  153.014353,      "02: Duhig Tower"],
  [-27.497107,  153.014836,                  None],
  [-27.497118,  153.014890,                  None],
  [-27.497154,  153.014813,                  None],
  [-27.496437,  153.014477,      "12: Duhig North"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"],
  [-27.497097,  153.014746,       "23: Abel Smith"],
  [-27.496390,  153.014415,  "32: Gordon Greenwod"],
  [-27.497112,  153.014780,           "03: Steele"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"],
  [-27.496487,  153.014622,      "02: Duhig Tower"],
  [-27.497075,  153.014532,                  None],
  [-27.497103,  153.014817,        "25: UQ Sports"],
  [-27.496754,  153.014504,      "02: Duhig Tower"],
  [-27.496567,  153.014294,      "02: Duhig Tower"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"]],
    columns = ["latitude", "longitude", "user_service"])


def closest_point_service_name(point, points, user_services):
    """ Find closest point with non null user_service """

    #First we filter the points and user_services by the ones that don't have null user_service
    points = points[user_services != None]
    user_services = user_services[user_services != None]

    #we use cdist to get all distances between pairs of points
    distances = cdist([point], points)[0]

    #we don't want to consider the current point
    distances[distances == 0] = np.inf

    #we get the index of the closest point
    closest_point_index = distances.argmin()

    #we return the user_service of the closest point that has a user_service
    closest_point_user_service = user_services[closest_point_index]
    return closest_point_user_service

#we convert the lat and long to a pair
df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]

#we create the additional column
df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]

#finally, we fill nulls
df.user_service = df.user_service.fillna(df['closest'])

del df['closest']

df

这是输出结果

latitude    longitude   user_service    point
0   -27.496404  153.014353  02: Duhig Tower     (-27.496404, 153.014353)
1   -27.497107  153.014836  25: UQ Sports   (-27.497107, 153.014836)
2   -27.497118  153.014890  25: UQ Sports   (-27.497118, 153.01489)
3   -27.497154  153.014813  32: Gordon Greenwod     (-27.497154, 153.014813)
4   -27.496437  153.014477  12: Duhig North     (-27.496437, 153.014477)
5   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
6   -27.497097  153.014746  23: Abel Smith  (-27.497097, 153.014746)
7   -27.496390  153.014415  32: Gordon Greenwod     (-27.49639, 153.014415)
8   -27.497112  153.014780  03: Steele  (-27.497112, 153.01478)
9   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
10  -27.496487  153.014622  02: Duhig Tower     (-27.496487, 153.014622)
11  -27.497075  153.014532  23: Abel Smith  (-27.497075, 153.014532)
12  -27.497103  153.014817  25: UQ Sports   (-27.497103, 153.014817)
13  -27.496754  153.014504  02: Duhig Tower     (-27.496754, 153.014504)
14  -27.496567  153.014294  02: Duhig Tower     (-27.496567, 153.014294)
15  -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
© www.soinside.com 2019 - 2024. All rights reserved.