在Gensim添加停用词

问题描述投票：0回答：2

谢谢你的到来！我有一个关于附加停用词的快速问题。我有几个单词显示在我的数据集中，我正在跳跃，我可以将它们添加到gensims停止单词列表。我已经看过很多使用nltk的例子，我希望有一种方法可以在gensim中做同样的事情。我将在下面发布我的代码：

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

python windows nlp gensim stop-words

2个回答

1
投票

虽然gensim.parsing.preprocessing.STOPWORDS是为了您的方便而预先定义的，并且恰好是frozenset，因此无法直接添加，您可以轻松地制作一个包含这些单词和添加内容的更大集合。例如：

from gensim.parsing.preprocessing import STOPWORDS
my_stop_words = STOPWORDS.union(set(['mystopword1', 'mystopword2']))

然后在随后的停用词删除代码中使用新的更大的my_stop_words。（simple_preprocess()的gensim函数不会自动删除停用词。）

0
投票

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        newStopWords = ['stopword1','stopword2']
        if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.