Python NLP 处理 if 语句不在停用词列表中

问题描述 投票:0回答:4

我正在使用 NLP Spacy 库,我创建了一个函数来从文本中返回标记列表

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

此功能不正确,因为删除停用词不起作用。 只有我删除最后一个条件,一切都好

and not in stop_words
.

除了所有其他条件语句之外,如何升级此功能以根据定义的列表删除停用词?

python nlp spacy
4个回答
0
投票

我认为“不在停用词中”是一个布尔值,你的停用词类型是什么样的? 如果 stop_words 是一个列表,这是一个语法错误。


0
投票

需要在函数中添加停用词,该函数将停用词列表作为输入,然后需要修改将单词添加到标记列表的条件,以检查单词是否在停用词列表中

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load('en_core_web_sm')
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

样品:

text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)

输出:

['this', 'sample', 'text', 'to', 'demonstrate', 'function']

0
投票

你写错了你的条件。你最后的

elif
相当于这个:

condC = not in stop_words
elif condA and condB and not in condC:
    ...

如果你尝试执行这段代码,你会得到一个语法错误。要检查某个元素是否在某个可迭代对象中,您需要在关键字

in
的左侧提供该元素。你只需要写
word

elif condA and condB and ... and str(word) not in stop_words:
   ...

0
投票

你的代码对我来说很好,有一个小的变化

在 elif 的末尾放 and str(word) not in stop_words

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)
© www.soinside.com 2019 - 2024. All rights reserved.