在比较两个数据帧 - df1、df2 后,我正在尝试开发一个新的数据帧(df3)。我的 df1 看起来像这样:
num step latitude longitude time height valid_time windspeed
0 1 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 1.2482048
1 1 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 0.34045473
2 1 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 0.63618374
3 2 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 0.79829866
4 2 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 0.7331676
5 2 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 1.3981003
6 3 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 1.0158184
7 3 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 1.1108123
8 3 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 3.4528110
我的匹配或参考数据框 - df2 看起来像这样:
site latitude longitude
0 Stevenson 45.69 -121.89
1 Rainier 45.00 -115.00
我正在尝试开发一个 pyhton 脚本来根据 df2 中的每组“纬度”和“经度”以及每个“num”和“valid_time”来过滤 df1。所以,“df3”应该看起来像这样:
我尝试过很多事情。这是最新的和错误:
from scipy.spatial.distance import cdist
df1 = pf_new
df2 = df_sites
# Function to calculate Haversine distance between two sets of coordinates
def haversine(lat1, lon1, lat2, lon2):
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of Earth in kilometers
return c * r
distances = cdist(df1[['latitude','longitude']],df2[['latitude','longitude']],haversine)
closest_row_indices = np.argmin(distances, axis=0)
df3 = df1.iloc[closest_row_indices].reset_index(drop=True)
print(df3)
并且,出现以下错误。感谢您提供有关如何继续的任何想法!!
Traceback (most recent call last):
Cell In[22], line 16
distances = cdist(df1[['latitude','longitude']],df2[['latitude','longitude']],haversine)
File ~\Anaconda3\envs\Stats\lib\site-packages\scipy\spatial\distance.py:2933 in cdist
return _cdist_callable(XA, XB, metric=metric, out=out, **kwargs)
File ~\Anaconda3\envs\Stats\lib\site-packages\scipy\spatial\distance.py:2604 in _cdist_callable
dm[i, j] = metric(XA[i], XB[j], **kwargs)
TypeError: haversine() missing 2 required positional arguments: 'lat2' and 'lon2'
错误非常明显:
TypeError: haversine() missing 2 required positional arguments: 'lat2' and 'lon2'
当 lat2
调用
lon2
函数时,未设置
cdist
和 haversine
,因此参数数量不足。这是因为 lat1
包含 df1
的纬度/经度值和 lat2
的 df2
。
使用下面的代码来修复它:
def haversine(coord1, coord2):
(lat1, lon1), (lat2, lon2) = coord1, coord2
# the rest of your code
distances = ...
closest_row_indices = np.argmin(distances, axis=1) # modify the axis here
# Get site name by position (and not label): iloc vs loc
df1['site'] = df2.iloc[closest_row_indices, df2.columns.get_loc('site')].values
输出:
>>> df1
num step latitude longitude time height valid_time windspeed site
0 1 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 1.248205 Stevenson
1 1 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 0.340455 Stevenson
2 1 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 0.636184 Stevenson
3 2 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 0.798299 Stevenson
4 2 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 0.733168 Stevenson
5 2 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 1.398100 Stevenson
6 3 0 days 46.0 -122.0 2023-08-23 10.0 2023-08-23 1.015818 Stevenson
7 3 0 days 45.5 -121.5 2023-08-23 10.0 2023-08-23 1.110812 Stevenson
8 3 0 days 45.0 -121.0 2023-08-23 10.0 2023-08-23 3.452811 Stevenson
详情:
>>> distances
array([[ 35.50785241, 556.58217101],
[ 36.97464075, 511.72360862],
[103.55808238, 471.6522885 ],
[ 35.50785241, 556.58217101],
[ 36.97464075, 511.72360862],
[103.55808238, 471.6522885 ],
[ 35.50785241, 556.58217101],
[ 36.97464075, 511.72360862],
[103.55808238, 471.6522885 ]])
顺便说一句,使用半正弦距离考虑地球曲率是正确的,但在计算长距离时它很有价值。如果您使用欧几里德距离(默认为
cdist
),则漂移为每公里 8 厘米,因此对于 35 公里,误差小于 3 米 (2.88 m)。除非您的网站确实关闭,否则我认为使用半正矢距离(恕我直言)不是很有用。
还有一点,有一种更有效的方法来找到最近的站点。您可以使用空间分区。我已经回答过这样的问题了:
BallTree
(scikit-learn):
KDTree
(scipy):