更快地删除Python中的停用词

Question

我试图从一串文本中删除停用词：

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

我正在处理6密耳的这种弦，所以速度很重要。分析我的代码，最慢的部分是上面的行，有没有更好的方法来做到这一点？我正在考虑使用像正则表达式的re.sub这样的东西，但我不知道如何为一组单词编写模式。有人可以帮助我，我也很高兴听到其他可能更快的方法。

注意：我试过有人建议用stopwords.words('english')包装set()，但这没有任何区别。

谢谢。

Answer 1

尝试缓存stopwords对象，如下所示。每次调用函数时构造它似乎都是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我通过探查器运行了这个：python -m cProfile -s cumulative test.py.相关行列在下面。

n呼叫累计时间

10000 7.723 words.朋友:7(test fun cold)

10000 0.140 words.朋友:11(test func new)

因此，缓存停用词实例可提供~70倍的加速。

Answer 2

使用正则表达式删除所有不匹配的单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

这可能比循环自己更快，特别是对于大输入字符串。

如果文本中的最后一个单词被删除，则可能有尾随空格。我建议分开处理。

Answer 3

首先，您要为每个字符串创建停用词。创建一次。确实，这里集合很棒。

forbidden_words = set(stopwords.words('english'))

之后，摆脱[]内的join。请改用发电机。

' '.join([x for x in ['a', 'b', 'c']])

替换为

' '.join(x for x in ['a', 'b', 'c'])

接下来要处理的是使.split()屈服值而不是返回一个数组。我相信regex在这里会很好的替代品。请参阅thist hread，了解为什么s.split()实际上很快。

最后，并行完成这样的工作（删除6m字符串中的停用词）。这是一个完全不同的主题。

Answer 4

抱歉回复晚了。对新用户有用。

使用集合库创建一个停用词词典
使用该字典进行非常快速的搜索（时间= O（1））而不是在列表上进行（时间= O（停用词）） from collections import Counter stop_words = stopwords.words('english') stopwords_dict = Collections.counter(stop_words) text = ' '.join([word for word in text.split() if word not in stopwords_dict])

更快地删除Python中的停用词

问题描述投票：28回答：4

4个回答

最新问题

更快地删除Python中的停用词

问题描述 投票：28回答：4

4个回答

最新问题

问题描述投票：28回答：4