如何加快 NLP 中停用词删除和词形还原的计算时间

Question

作为文本分类模型预处理的一部分，我使用 NLTK 库添加了停用词删除和词形还原步骤。代码如下：

import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Stopwords removal


def remove_stopwords(entry):
  sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
  return " ".join(sentence_list)

df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))

# Lemmatization

lemmatizer = WordNetLemmatizer()

def punct_strip(string):
  s = re.sub(r'[^\w\s]',' ',string)
  return s

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_rows(entry):
  sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
  return " ".join(sentence_list)

df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))

问题是，当我预处理包含 27k 个条目的数据集（我的测试集）时，删除停用词需要 40-45 秒，词形还原也需要同样长的时间。相比之下，模型评估只需要 2-3 秒。

如何重写函数以优化计算速度？我读过一些有关矢量化的内容，但示例函数比我报告的函数简单得多，我不知道在这种情况下该怎么做。

Answer 1

这里提出了类似的问题，并建议您尝试缓存stopwords.words("english")

对象。在您的方法

remove_stopwords

中，您每次评估条目时都会创建对象。所以，你绝对可以改进这一点。关于您的

lemmatizer

，如

here所述，您还可以缓存结果以提高性能。我可以想象你的pandas

操作也相当昂贵。您可以考虑将数据帧转换为数组或字典，然后对其进行迭代。如果您稍后需要数据框，您可以轻松地将其转换回来。

Answer 2

#take 1:
def remove_stopwords1(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)


#take2

def remove_stopwords2(text):
    new_text = []
    l = text.split()
    stopword_list = stopwords.words('english')
    for word in l:
        if word in stopword_list:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

start = time.time()
remove_stopwords1(df['review'][0])
time2 = time.time() - start
print(time2*50,000)


start = time.time()
df['review'] = df['review'].apply(remove_stopwords2)
time2 = time.time() - start
print(time2)

take1 所用时间：7k+ 秒

take2 所用时间：148 秒

如何加快 NLP 中停用词删除和词形还原的计算时间

问题描述投票：0回答：2

2个回答

最新问题

如何加快 NLP 中停用词删除和词形还原的计算时间

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2