Python NLP 处理 if 语句不在停用词列表中

Question

我正在使用 NLP Spacy 库，我创建了一个函数来从文本中返回标记列表

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

此功能不正确，因为删除停用词不起作用。只有我删除最后一个条件，一切都好

and not in stop_words

.

除了所有其他条件语句之外，如何升级此功能以根据定义的列表删除停用词？

Answer 1

我认为“不在停用词中”是一个布尔值，你的停用词类型是什么样的？如果 stop_words 是一个列表，这是一个语法错误。

Answer 2

需要在函数中添加停用词，该函数将停用词列表作为输入，然后需要修改将单词添加到标记列表的条件，以检查单词是否在停用词列表中

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load('en_core_web_sm')
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

样品：

text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)

输出：

['this', 'sample', 'text', 'to', 'demonstrate', 'function']

Answer 3

你写错了你的条件。你最后的

elif

相当于这个：

condC = not in stop_words
elif condA and condB and not in condC:
    ...

如果你尝试执行这段代码，你会得到一个语法错误。要检查某个元素是否在某个可迭代对象中，您需要在关键字

in

的左侧提供该元素。你只需要写

word

：

elif condA and condB and ... and str(word) not in stop_words:
   ...

Answer 4

你的代码对我来说很好，有一个小的变化

在 elif 的末尾放 and str(word) not in stop_words

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

Python NLP 处理 if 语句不在停用词列表中

问题描述投票：0回答：4

4个回答

最新问题

Python NLP 处理 if 语句不在停用词列表中

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4