我正在为我的项目编写文本。 我想通过查找字典中的所有单词来替换数据集中文本中的单词。 我的字典是这样的;
replacement_dict ={'t1' : 'tebir', 't2':'teki', 'number':'no', ...}
数据框中的示例文本;
"Hello my t1 is not okey, please help my number is bla bla"
将会;
"Hello my tebir is not okey, please help my no is bla bla"
我写了以下代码;
import pandas
def replacament(row,replacement_dict):
text = row['text']
text = text.lower()
for i, j in replacement_dict.items():
text = re.sub(r"\b%s\b" % i, j, text)
return text
data['text2'] = data.apply(replacament, axis = 1, args=(replacement_dict,))
但是需要8个小时才能完成。我的日期集行大小是 600000。我怎样才能加快这个应用功能? 谢谢,
apply
可能相当慢,但 pandas 有一个 replace
函数,可以通过首先使用正则表达式形成另一个字典来仅替换整个单词(假设您只想替换单词而不是单词中的子字符串)来使用该函数:
d = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in replacement_dict.items()}
df['text2'] = df['text'].replace(d, regex = True)
也许
swifter
(https://github.com/jmcarpenter2/swifter)对你有帮助。
我更新了您的输入并为测试创建了 600K 行。需要 1.43 秒。
代码:
import re
import swifter
import pandas as pd
import time
data = pd.DataFrame({'text': ["t2 my t1 is not okey, number please help my is bla bla"] * 600000})
replacement_dict = {'t1': 'tebir', 't2': 'teki', 'number': 'no'}
start_time = time.time()
def replace_text(text):
for k, v in replacement_dict.items():
text = text.replace(k, v)
return text
data['text2_parallel'] = data['text'].swifter.apply(replace_text)
parallel_time = time.time() - start_time
print(f"Parallelized time: {parallel_time:.2f} seconds")
print(data[['text2_parallel']])
输出:
Pandas Apply: 100%|████████████████████████████████████████████████████████| 600000/600000 [00:01<00:00, 458107.50it/s]
Parallelized time: 1.43 seconds
text2_parallel
0 teki my tebir is not okey, no please help my ...
1 teki my tebir is not okey, no please help my ...
2 teki my tebir is not okey, no please help my ...
3 teki my tebir is not okey, no please help my ...
4 teki my tebir is not okey, no please help my ...
... ...
599995 teki my tebir is not okey, no please help my ...
599996 teki my tebir is not okey, no please help my ...
599997 teki my tebir is not okey, no please help my ...
599998 teki my tebir is not okey, no please help my ...
599999 teki my tebir is not okey, no please help my ...
[600000 rows x 1 columns]