我的 Python 代码需要 10 分钟才能在 Visual Studio Code 中运行

问题描述 投票:0回答:1

我正在尝试从 .csv 文件中的“reviews.text”列中删除停用词。当我运行代码时,输出需要 10 分钟。

如何加快运行速度?

import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

chdir(path.dirname(__file__))

file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })

reviews_data = dataframe['reviews.text']

clean_data = dataframe.dropna(subset=['reviews.text'])

def preprocess_text(text):
    doc = nlp(text)
    
    cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    
    cleaned_text = ' '.join(cleaned_tokens)
    
    return cleaned_text

clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)

print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())

编辑:我运行 cProfile 来查看代码的哪些区域花费了最多时间。请参阅下面我的 cProfile 结果:

  302681427 function calls (296741014 primitive calls) in 294.594 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     10/1    0.000    0.000  294.659  294.659 {built-in method builtins.exec}
        1    0.003    0.003  294.639  294.639 test3.py:10(main)
        1    0.000    0.000  293.915  293.915 series.py:4769(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1409(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1482(apply_standard)
        1    0.000    0.000  293.915  293.915 base.py:891(_map_values)
        1    0.121    0.121  293.915  293.915 algorithms.py:1667(map_array)
    34659    0.047    0.000  293.793    0.008 test3.py:29(preprocess_text)
    34659    0.465    0.000  293.253    0.008 language.py:1016(__call__)
   138636   32.197    0.000  242.236    0.002 trainable_pipe.pyx:40(__call__)
   138636    0.531    0.000  205.376    0.001 model.py:330(predict)
4678965/277272    1.998    0.000  203.319    0.001 model.py:307(__call__)
1628973/138636    2.245    0.000  187.239    0.001 chain.py:48(forward)
   242613    0.263    0.000  180.916    0.001 with_array.py:32(forward)
   519885  157.488    0.000  157.731    0.000 numpy_ops.pyx:91(gemm)
   346590    2.671    0.000  145.591    0.000 maxout.py:45(forward)
   103977    0.291    0.000  132.341    0.001 with_array.py:70(_list_forward)
   277272    0.548    0.000  127.896    0.000 residual.py:28(forward)
    69318    0.632    0.000  107.110    0.002 tb_framework.py:33(forward)
python performance spacy stop-words
1个回答
0
投票

仅包含您真正需要的

nlp
处理器,您应该能够节省很多时间!

我有一个小的测试输入集,因此结果可能不适用于较大的输入集,但对我来说,这提供了巨大的加速:

def preprocess_text(text):        
    with nlp.select_pipes(enable="tagger"):
        return ' '.join(token.text.lower() for token in nlp(text) if token.is_alpha and not token.is_stop)

管道处理器的列表、它们的作用以及如何调整它们可以从 https://spacy.io/usage/processing-pipelines找到。

除此之外,在

apply
中调用
pandas
总是很慢,但我至少不知道如何通过向量化操作来解决这个特定问题。

© www.soinside.com 2019 - 2024. All rights reserved.