我的 Python 代码需要 10 分钟才能在 Visual Studio Code 中运行

Question

我正在尝试从 .csv 文件中的“reviews.text”列中删除停用词。当我运行代码时，输出需要 10 分钟。

如何加快运行速度？

import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

chdir(path.dirname(__file__))

file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })

reviews_data = dataframe['reviews.text']

clean_data = dataframe.dropna(subset=['reviews.text'])

def preprocess_text(text):
    doc = nlp(text)
    
    cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    
    cleaned_text = ' '.join(cleaned_tokens)
    
    return cleaned_text

clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)

print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())

编辑：我运行 cProfile 来查看代码的哪些区域花费了最多时间。请参阅下面我的 cProfile 结果：

  302681427 function calls (296741014 primitive calls) in 294.594 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     10/1    0.000    0.000  294.659  294.659 {built-in method builtins.exec}
        1    0.003    0.003  294.639  294.639 test3.py:10(main)
        1    0.000    0.000  293.915  293.915 series.py:4769(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1409(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1482(apply_standard)
        1    0.000    0.000  293.915  293.915 base.py:891(_map_values)
        1    0.121    0.121  293.915  293.915 algorithms.py:1667(map_array)
    34659    0.047    0.000  293.793    0.008 test3.py:29(preprocess_text)
    34659    0.465    0.000  293.253    0.008 language.py:1016(__call__)
   138636   32.197    0.000  242.236    0.002 trainable_pipe.pyx:40(__call__)
   138636    0.531    0.000  205.376    0.001 model.py:330(predict)
4678965/277272    1.998    0.000  203.319    0.001 model.py:307(__call__)
1628973/138636    2.245    0.000  187.239    0.001 chain.py:48(forward)
   242613    0.263    0.000  180.916    0.001 with_array.py:32(forward)
   519885  157.488    0.000  157.731    0.000 numpy_ops.pyx:91(gemm)
   346590    2.671    0.000  145.591    0.000 maxout.py:45(forward)
   103977    0.291    0.000  132.341    0.001 with_array.py:70(_list_forward)
   277272    0.548    0.000  127.896    0.000 residual.py:28(forward)
    69318    0.632    0.000  107.110    0.002 tb_framework.py:33(forward)

Answer 1

仅包含您真正需要的

nlp

处理器，您应该能够节省很多时间！

我有一个小的测试输入集，因此结果可能不适用于较大的输入集，但对我来说，这提供了巨大的加速：

def preprocess_text(text):        
    with nlp.select_pipes(enable="tagger"):
        return ' '.join(token.text.lower() for token in nlp(text) if token.is_alpha and not token.is_stop)

管道处理器的列表、它们的作用以及如何调整它们可以从 https://spacy.io/usage/processing-pipelines找到。

除此之外，在

apply

中调用

pandas

总是很慢，但我至少不知道如何通过向量化操作来解决这个特定问题。

我的 Python 代码需要 10 分钟才能在 Visual Studio Code 中运行

问题描述投票：0回答：1

1个回答

最新问题

我的 Python 代码需要 10 分钟才能在 Visual Studio Code 中运行

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1