如何加快熊猫应用功能以在数据框中创建新列？

Question

在我的熊猫数据框中，我有一列包含用户位置。我创建了一个从位置识别国家的功能，并且我想用国家名称创建一个新列。函数是：

from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np

def do_fuzzy_search(location):
    if type(location) == float and np.isnan(location):
        return np.nan
    else:
      try:
          result = pycountry.countries.search_fuzzy(location)
      except Exception:
          try:
              loc = geolocator.geocode(str(location))
              return loc.raw['display_name'].split(', ')[-1]
          except:
              return np.nan
      else:
          return result[0].name

传递任何位置名称后，该函数将返回国家/地区的名称。对于前-

do_fuzzy_search("Bombay")返回'India'。

我只是想使用apply函数创建一个新列。

df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)

但是这需要永远的时间。我尝试了在Stackoverflow上发布的其他问题和以相同主题编写的博客中提到的一些技术，例如Performance of Pandas apply vs np.vectorize，Optimizing Pandas Code for Speed，Speed up pandas using dask or swift和Speed up pandas using cudf。

使用各种技术仅执行该列的前10行所花费的时间如下：

%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

使用Swifter库：

%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

使用np.vectorize

%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
       'Italia', 'Australia', 'India', 'India', 'United Kingdom',
       'Singapore'], dtype='<U24')

也使用了Dask's map_partitions，它没有比apply函数带来更多的性能提升。

import dask.dataframe as dd
import multiprocessing

dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
   .map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')

每种技术10行的计算时间超过5秒。这将永远占用10万行。我也尝试实现cudf，但这使我的colab笔记本崩溃了。

如何在合理的时间内提高性能并获得结果？

Answer 1

[在大多数情况下，.apply()很慢，因为它在数据帧的每一行调用一次琐碎的可并行化函数，但是在您的情况下，您正在调用外部API。因此，网络访问和API速率限制可能是确定运行时间的主要因素。不幸的是，这意味着除了等待之外，您无能为力。

如果频繁重复某些元素，您可能可以通过用do_fuzzy_search装饰functools.lru_cache而受益，因为如果在缓存中找到该位置，该函数将避免API调用。

如何加快熊猫应用功能以在数据框中创建新列？

问题描述投票：1回答：1

1个回答

最新问题

如何加快熊猫应用功能以在数据框中创建新列？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1