我在pandas dataframe中有一系列城市名称。为此,我需要找出特定城市的地址,并将它们存储在同一数据框中的单独列中。 City列也包含NaN值。我分别获取给定位置/城市名称的地址。但它不适用于大熊猫数据帧
data = [['madurai',10],['NaN',12],['hosur',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
from geopy.geocoders import Nominatim
geolocator = Nominatim()
for i in df.Name:
if i == "NaN":
continue
loc = geolocator.geocode(i)
address = loc.address
print(address)
它适用于数据框,但仅返回最后一个地址,而不是整个3个城市。如果我们改变下面的顺序,
data = [['Nan',10],['Madurai',12],['hosur',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
我收到错误:GeocoderTimedOut:服务超时
查询:1。我希望结果(地址)保存在第2列中。如何处理Nan值
您可以通过以下方式添加包含地址的列:
import pandas as pd
data = [['madurai',10],['NaN',12],['hosur',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
from geopy.geocoders import Nominatim
geolocator = Nominatim()
for i in df.Name:
if i == "NaN":
continue
df.loc[df.Name == i, 'Address'] = geolocator.geocode(i)
print(df)
您只获得最后的值,因为您将继续在循环中替换loc
each时间。 GeocoderTimedOut: Service timed out
错误的产生是因为您向服务器发出了许多请求。您应该在请求之间包含sleep
。如果你仍然得到这个错误,请看一下:Link - Avoid time out
尝试:
import pandas as pd
from geopy.geocoders import Nominatim
import time
data = [['madurai',10],['NaN',12],['hosur',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
geolocator = Nominatim(user_agent='test')
address = []
for i in df.Name:
time.sleep(3)
if i == "NaN":
address.append('NaN')
continue
address.append(geolocator.geocode(i))
df['address'] = address
我在请求之间引入了时间延迟,如下所示,以及几行查看进度条
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
final['Geolocation'] = final['city'].apply(geocode)
from tqdm import tqdm
tqdm.pandas()
final['Geolocation'] = final['city'].progress_apply(geocode)
它现在有效。