如何使用for循环加速pandas应用功能? (Python)

问题描述 投票:0回答:2

我正在为我的项目编写文本。 我想通过查找字典中的所有单词来替换数据集中文本中的单词。 我的字典是这样的;

replacement_dict ={'t1' : 'tebir', 't2':'teki', 'number':'no', ...} 

数据框中的示例文本;

"Hello my t1 is not okey, please help my number is bla bla" 
将会;
"Hello my tebir is not okey, please help my no is bla bla" 

我写了以下代码;

import pandas
def replacament(row,replacement_dict):  
    text = row['text']
    text = text.lower()
    for i, j in replacement_dict.items():
        text = re.sub(r"\b%s\b" % i, j, text)          
    return text
data['text2'] = data.apply(replacament, axis = 1, args=(replacement_dict,))

但是需要8个小时才能完成。我的日期集行大小是 600000。我怎样才能加快这个应用功能? 谢谢,

python replace apply
2个回答
0
投票

apply
可能相当慢,但 pandas 有一个
replace
函数,可以通过首先使用正则表达式形成另一个字典来仅替换整个单词(假设您只想替换单词而不是单词中的子字符串)来使用该函数:

d = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in replacement_dict.items()}
df['text2'] = df['text'].replace(d, regex = True)

0
投票

也许

swifter
https://github.com/jmcarpenter2/swifter)对你有帮助。

我更新了您的输入并为测试创建了 600K 行。需要 1.43 秒。

代码:

import re
import swifter
import pandas as pd
import time

data = pd.DataFrame({'text': ["t2 my t1 is not okey, number please help my  is bla bla"] * 600000})
replacement_dict = {'t1': 'tebir', 't2': 'teki', 'number': 'no'}

start_time = time.time()
def replace_text(text):
    for k, v in replacement_dict.items():
        text = text.replace(k, v)
    return text

data['text2_parallel'] = data['text'].swifter.apply(replace_text)
parallel_time = time.time() - start_time
print(f"Parallelized time: {parallel_time:.2f} seconds")
print(data[['text2_parallel']])

输出:

Pandas Apply: 100%|████████████████████████████████████████████████████████| 600000/600000 [00:01<00:00, 458107.50it/s]
Parallelized time: 1.43 seconds
                                           text2_parallel
0       teki my tebir is not okey, no please help my  ...
1       teki my tebir is not okey, no please help my  ...
2       teki my tebir is not okey, no please help my  ...
3       teki my tebir is not okey, no please help my  ...
4       teki my tebir is not okey, no please help my  ...
...                                                   ...
599995  teki my tebir is not okey, no please help my  ...
599996  teki my tebir is not okey, no please help my  ...
599997  teki my tebir is not okey, no please help my  ...
599998  teki my tebir is not okey, no please help my  ...
599999  teki my tebir is not okey, no please help my  ...

[600000 rows x 1 columns]
© www.soinside.com 2019 - 2024. All rights reserved.