我正在尝试从 .csv 文件中的“reviews.text”列中删除停用词。当我运行代码时,输出需要 10 分钟。
如何加快运行速度?
import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
chdir(path.dirname(__file__))
file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })
reviews_data = dataframe['reviews.text']
clean_data = dataframe.dropna(subset=['reviews.text'])
def preprocess_text(text):
doc = nlp(text)
cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
cleaned_text = ' '.join(cleaned_tokens)
return cleaned_text
clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)
print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())
编辑:我运行 cProfile 来查看代码的哪些区域花费了最多时间。请参阅下面我的 cProfile 结果:
302681427 function calls (296741014 primitive calls) in 294.594 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
10/1 0.000 0.000 294.659 294.659 {built-in method builtins.exec}
1 0.003 0.003 294.639 294.639 test3.py:10(main)
1 0.000 0.000 293.915 293.915 series.py:4769(apply)
1 0.000 0.000 293.915 293.915 apply.py:1409(apply)
1 0.000 0.000 293.915 293.915 apply.py:1482(apply_standard)
1 0.000 0.000 293.915 293.915 base.py:891(_map_values)
1 0.121 0.121 293.915 293.915 algorithms.py:1667(map_array)
34659 0.047 0.000 293.793 0.008 test3.py:29(preprocess_text)
34659 0.465 0.000 293.253 0.008 language.py:1016(__call__)
138636 32.197 0.000 242.236 0.002 trainable_pipe.pyx:40(__call__)
138636 0.531 0.000 205.376 0.001 model.py:330(predict)
4678965/277272 1.998 0.000 203.319 0.001 model.py:307(__call__)
1628973/138636 2.245 0.000 187.239 0.001 chain.py:48(forward)
242613 0.263 0.000 180.916 0.001 with_array.py:32(forward)
519885 157.488 0.000 157.731 0.000 numpy_ops.pyx:91(gemm)
346590 2.671 0.000 145.591 0.000 maxout.py:45(forward)
103977 0.291 0.000 132.341 0.001 with_array.py:70(_list_forward)
277272 0.548 0.000 127.896 0.000 residual.py:28(forward)
69318 0.632 0.000 107.110 0.002 tb_framework.py:33(forward)
仅包含您真正需要的
nlp
处理器,您应该能够节省很多时间!
我有一个小的测试输入集,因此结果可能不适用于较大的输入集,但对我来说,这提供了巨大的加速:
def preprocess_text(text):
with nlp.select_pipes(enable="tagger"):
return ' '.join(token.text.lower() for token in nlp(text) if token.is_alpha and not token.is_stop)
管道处理器的列表、它们的作用以及如何调整它们可以从 https://spacy.io/usage/processing-pipelines找到。
除此之外,在
apply
中调用 pandas
总是很慢,但我至少不知道如何通过向量化操作来解决这个特定问题。