给定一个数据帧如下。
latitude longitude user_service
0 -27.496404 153.014353 02: Duhig Tower
1 -27.497107 153.014836 NaN
2 -27.497118 153.014890 NaN
3 -27.497154 153.014813 NaN
4 -27.496437 153.014477 12: Duhig North
5 -27.497156 153.014813 32: Gordon Greenwod
6 -27.497097 153.014746 23: Abel Smith
7 -27.496390 153.014415 32: Gordon Greenwod
8 -27.497112 153.014780 03: Steele
9 -27.497156 153.014813 32: Gordon Greenwod
10 -27.496487 153.014622 02: Duhig Tower
11 -27.497075 153.014532 NaN
12 -27.497103 153.014817 25: UQ Sports
13 -27.496754 153.014504 02: Duhig Tower
14 -27.496567 153.014294 02: Duhig Tower
15 -27.497156 153.014813 32: Gordon Greenwod
由于 user_service
列有缺失的值,所以我想也许可以用空间聚类的方法来填充nans。
例如,对于 latitude
和 longitude
对子 -27.497107, 153.014836
在第二行,如果 02: Duhig Tower
的位置是最接近它的距离,所以我想把楠楠填进去。user_service
与 02: Duhig Tower
这条记录。其他缺失的行也是同样的逻辑。
如何用Python实现上述逻辑?谢谢。
Guillermo Mosse的解决方案的输出,但仍然有一些 NaN
s:
latitude longitude user_service
0 -27.499012 153.015180 51: Zelman Cowen
1 -27.497600 153.014479 03: Steele
2 -27.500054 153.013435 50: Hawken Engineering
3 -27.495979 153.009834 NaN
4 -27.496748 153.017507 32: Gordon Greenwod
5 -27.495695 153.016178 38: UQ Multi Faith Chaplaincy
6 -27.497015 153.012492 01: Forgan Smith
7 -27.498797 153.017267 NaN
8 -27.500508 153.011360 75: AIBN
9 -27.496763 153.013795 01: Forgan Smith
10 -27.494909 153.017187 NaN
11 -27.496384 153.013810 12: Duhig North
检查 NaN
s:
var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))
s: Out:
2 NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1
理想情况下,你会希望使用潘达的。插值 用一个自定义的距离函数来填充NaN值,但该方法似乎没有任何方式可以扩展。
一个可能的解决方案是,对于每一个数据点,得到最接近的数据点的服务名,而这个服务名实际上有一个服务名。这里是一个完整的工作示例,是一个可能的解决方案。
import pandas as pd
from scipy.spatial.distance import cdist
import numpy as np
df = pd.DataFrame ([
[-27.496404, 153.014353, "02: Duhig Tower"],
[-27.497107, 153.014836, None],
[-27.497118, 153.014890, None],
[-27.497154, 153.014813, None],
[-27.496437, 153.014477, "12: Duhig North"],
[-27.497156, 153.014813, "32: Gordon Greenwod"],
[-27.497097, 153.014746, "23: Abel Smith"],
[-27.496390, 153.014415, "32: Gordon Greenwod"],
[-27.497112, 153.014780, "03: Steele"],
[-27.497156, 153.014813, "32: Gordon Greenwod"],
[-27.496487, 153.014622, "02: Duhig Tower"],
[-27.497075, 153.014532, None],
[-27.497103, 153.014817, "25: UQ Sports"],
[-27.496754, 153.014504, "02: Duhig Tower"],
[-27.496567, 153.014294, "02: Duhig Tower"],
[-27.497156, 153.014813, "32: Gordon Greenwod"]],
columns = ["latitude", "longitude", "user_service"])
def closest_point_service_name(point, points, user_services):
""" Find closest point with non null user_service """
#First we filter the points and user_services by the ones that don't have null user_service
points = points[user_services != None]
user_services = user_services[user_services != None]
#we use cdist to get all distances between pairs of points
distances = cdist([point], points)[0]
#we don't want to consider the current point
distances[distances == 0] = np.inf
#we get the index of the closest point
closest_point_index = distances.argmin()
#we return the user_service of the closest point that has a user_service
closest_point_user_service = user_services[closest_point_index]
return closest_point_user_service
#we convert the lat and long to a pair
df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]
#we create the additional column
df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]
#finally, we fill nulls
df.user_service = df.user_service.fillna(df['closest'])
del df['closest']
df
这是输出结果
latitude longitude user_service point
0 -27.496404 153.014353 02: Duhig Tower (-27.496404, 153.014353)
1 -27.497107 153.014836 25: UQ Sports (-27.497107, 153.014836)
2 -27.497118 153.014890 25: UQ Sports (-27.497118, 153.01489)
3 -27.497154 153.014813 32: Gordon Greenwod (-27.497154, 153.014813)
4 -27.496437 153.014477 12: Duhig North (-27.496437, 153.014477)
5 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)
6 -27.497097 153.014746 23: Abel Smith (-27.497097, 153.014746)
7 -27.496390 153.014415 32: Gordon Greenwod (-27.49639, 153.014415)
8 -27.497112 153.014780 03: Steele (-27.497112, 153.01478)
9 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)
10 -27.496487 153.014622 02: Duhig Tower (-27.496487, 153.014622)
11 -27.497075 153.014532 23: Abel Smith (-27.497075, 153.014532)
12 -27.497103 153.014817 25: UQ Sports (-27.497103, 153.014817)
13 -27.496754 153.014504 02: Duhig Tower (-27.496754, 153.014504)
14 -27.496567 153.014294 02: Duhig Tower (-27.496567, 153.014294)
15 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)