我正在使用 NLP Spacy 库,我创建了一个函数来从文本中返回标记列表
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
tokens.add(word.lower_)
return list(tokens)
此功能不正确,因为删除停用词不起作用。 只有我删除最后一个条件,一切都好
and not in stop_words
.
除了所有其他条件语句之外,如何升级此功能以根据定义的列表删除停用词?
我认为“不在停用词中”是一个布尔值,你的停用词类型是什么样的? 如果 stop_words 是一个列表,这是一个语法错误。
需要在函数中添加停用词,该函数将停用词列表作为输入,然后需要修改将单词添加到标记列表的条件,以检查单词是否在停用词列表中
def preprocess_text_spacy(text, stop_words):
nlp = spacy.load('en_core_web_sm')
tokens = []
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.append(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.append(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.append(word.lower_)
return tokens
样品:
text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)
输出:
['this', 'sample', 'text', 'to', 'demonstrate', 'function']
你写错了你的条件。你最后的
elif
相当于这个:
condC = not in stop_words
elif condA and condB and not in condC:
...
如果你尝试执行这段代码,你会得到一个语法错误。要检查某个元素是否在某个可迭代对象中,您需要在关键字
in
的左侧提供该元素。你只需要写word
:
elif condA and condB and ... and str(word) not in stop_words:
...
你的代码对我来说很好,有一个小的变化
在 elif 的末尾放 and str(word) not in stop_words
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)