添加/删除带有自定义的自定义停用词

Question

用空格添加/删除停用词的最佳方法是什么？我正在使用token.is_stop函数，并希望对该集合进行一些自定义更改。我正在查看文档，但找不到有关停用词的任何内容。谢谢！

Answer 1

您可以在像这样处理文本之前编辑它们（请参见token.is_stop）：

this post

注意：这似乎在<= v1.8下有效。对于较新的版本，请参见其他答案。

Answer 2

使用Spacy 2.0.11，您可以使用以下任一方法更新其停用词集：

添加单个停用词：

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

一次添加多个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

删除单个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

一次删除多个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

注意：要查看当前的停用词集，请使用：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

更新：注释中指出此修补程序仅影响当前执行。要更新模型，可以使用print(nlp.Defaults.stop_words)和nlp.to_disk("/path")方法（在nlp.from_disk("/path")处有更详细的描述）。

Answer 3

对于2.0版，我使用了此：

https://spacy.io/usage/saving-loading

这会将所有停用词加载到集合中。

您可以将停用词修改为from spacy.lang.en.stop_words import STOP_WORDS print(STOP_WORDS) # <- set of Spacy's default stop words STOP_WORDS.add("your_additional_stop_word_here") for word in STOP_WORDS: lexeme = nlp.vocab[word] lexeme.is_stop = True或首先使用您自己的列表。

Answer 4

2
投票

对于2.0，请使用以下代码：

STOP_WORDS

Answer 5

以下最新版本会将单词从列表中删除：

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

Answer 6

这也收集停用词：)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS spacy_stopwords.remove('not')

添加/删除带有自定义的自定义停用词

问题描述投票：39回答：6

6个回答

最新问题

添加/删除带有自定义的自定义停用词

问题描述 投票：39回答：6

6个回答

最新问题

问题描述投票：39回答：6