在 Pandas 中按纬度、经度和迭代查找最近的行

问题描述 投票:0回答:1

在比较两个数据帧 - df1、df2 后,我正在尝试开发一个新的数据帧(df3)。我的 df1 看起来像这样:

                num       step       latitude           longitude        time                     height   valid_time           windspeed
0              1              0 days   46.0        -122.0    2023-08-23          10.0        2023-08-23          1.2482048
1              1              0 days   45.5        -121.5    2023-08-23          10.0        2023-08-23          0.34045473
2              1              0 days   45.0        -121.0    2023-08-23          10.0        2023-08-23          0.63618374
3              2              0 days   46.0        -122.0    2023-08-23          10.0        2023-08-23          0.79829866
4              2              0 days   45.5        -121.5    2023-08-23          10.0        2023-08-23          0.7331676
5              2              0 days   45.0        -121.0    2023-08-23          10.0        2023-08-23          1.3981003
6              3              0 days   46.0        -122.0    2023-08-23          10.0        2023-08-23          1.0158184
7              3              0 days   45.5        -121.5    2023-08-23          10.0        2023-08-23          1.1108123
8              3              0 days   45.0        -121.0    2023-08-23          10.0        2023-08-23          3.4528110

我的匹配或参考数据框 - df2 看起来像这样:

        site  latitude  longitude
0  Stevenson     45.69    -121.89
1  Rainier       45.00    -115.00

我正在尝试开发一个 pyhton 脚本来根据 df2 中的每组“纬度”和“经度”以及每个“num”和“valid_time”来过滤 df1。所以,“df3”应该看起来像这样:

我尝试过很多事情。这是最新的和错误:

from scipy.spatial.distance import cdist
df1 = pf_new
df2 = df_sites

# Function to calculate Haversine distance between two sets of coordinates
def haversine(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Radius of Earth in kilometers
    return c * r

    
distances = cdist(df1[['latitude','longitude']],df2[['latitude','longitude']],haversine)
closest_row_indices = np.argmin(distances, axis=0)
    

df3 = df1.iloc[closest_row_indices].reset_index(drop=True)
print(df3)

并且,出现以下错误。感谢您提供有关如何继续的任何想法!!

Traceback (most recent call last):

  Cell In[22], line 16
    distances = cdist(df1[['latitude','longitude']],df2[['latitude','longitude']],haversine)

  File ~\Anaconda3\envs\Stats\lib\site-packages\scipy\spatial\distance.py:2933 in cdist
    return _cdist_callable(XA, XB, metric=metric, out=out, **kwargs)

  File ~\Anaconda3\envs\Stats\lib\site-packages\scipy\spatial\distance.py:2604 in _cdist_callable
    dm[i, j] = metric(XA[i], XB[j], **kwargs)

TypeError: haversine() missing 2 required positional arguments: 'lat2' and 'lon2'
pandas scikit-learn matching nearest-neighbor
1个回答
0
投票

错误非常明显:

TypeError: haversine() missing 2 required positional arguments: 'lat2' and 'lon2'

lat2

 调用 
lon2
 函数时,未设置 
cdist
haversine
,因此参数数量不足。这是因为
lat1
包含
df1
的纬度/经度值和
lat2
df2

使用下面的代码来修复它:

def haversine(coord1, coord2):
    (lat1, lon1), (lat2, lon2) = coord1, coord2
    # the rest of your code

distances = ...
closest_row_indices = np.argmin(distances, axis=1)  # modify the axis here

# Get site name by position (and not label): iloc vs loc
df1['site'] = df2.iloc[closest_row_indices, df2.columns.get_loc('site')].values

输出:

>>> df1
   num    step  latitude  longitude        time  height  valid_time  windspeed       site
0    1  0 days      46.0     -122.0  2023-08-23    10.0  2023-08-23   1.248205  Stevenson
1    1  0 days      45.5     -121.5  2023-08-23    10.0  2023-08-23   0.340455  Stevenson
2    1  0 days      45.0     -121.0  2023-08-23    10.0  2023-08-23   0.636184  Stevenson
3    2  0 days      46.0     -122.0  2023-08-23    10.0  2023-08-23   0.798299  Stevenson
4    2  0 days      45.5     -121.5  2023-08-23    10.0  2023-08-23   0.733168  Stevenson
5    2  0 days      45.0     -121.0  2023-08-23    10.0  2023-08-23   1.398100  Stevenson
6    3  0 days      46.0     -122.0  2023-08-23    10.0  2023-08-23   1.015818  Stevenson
7    3  0 days      45.5     -121.5  2023-08-23    10.0  2023-08-23   1.110812  Stevenson
8    3  0 days      45.0     -121.0  2023-08-23    10.0  2023-08-23   3.452811  Stevenson

详情:

>>> distances
array([[ 35.50785241, 556.58217101],
       [ 36.97464075, 511.72360862],
       [103.55808238, 471.6522885 ],
       [ 35.50785241, 556.58217101],
       [ 36.97464075, 511.72360862],
       [103.55808238, 471.6522885 ],
       [ 35.50785241, 556.58217101],
       [ 36.97464075, 511.72360862],
       [103.55808238, 471.6522885 ]])

顺便说一句,使用半正弦距离考虑地球曲率是正确的,但在计算长距离时它很有价值。如果您使用欧几里德距离(默认为

cdist
),则漂移为每公里 8 厘米,因此对于 35 公里,误差小于 3 米 (2.88 m)。除非您的网站确实关闭,否则我认为使用半正矢距离(恕我直言)不是很有用。

还有一点,有一种更有效的方法来找到最近的站点。您可以使用空间分区。我已经回答过这样的问题了:

BallTree
(scikit-learn):

KDTree
(scipy):

© www.soinside.com 2019 - 2024. All rights reserved.