为什么我的单词词形还原没有按预期工作？

Question

stackoverflow 社区大家好！长期读者，但第一次发帖。我目前正在尝试 NLP，在阅读了一些涉及该主题的论坛帖子后，我似乎无法让词形还原器正常工作（下面粘贴的功能）。比较我的原始文本和预处理文本，除了词形还原之外，所有清理步骤都按预期工作。我什至尝试指定词性：“v”不默认该单词作为名词，并且仍然获得动词的基本形式（例如：turned->turn，are->be，reading->read）。 ..但这似乎不起作用。

感谢另一双眼睛和反馈 - 谢谢！

# key imports

import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions


# cleaning functions

def to_lower(text):
    '''
    Convert text to lowercase
    '''
    return text.lower()

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation)

def remove_stopwords(text):
    '''
    Removes stop words which don't have meaning (ex: is, the, a, etc.)
    '''
    additional_stopwords = ['app']

    stop_words = set(stopwords.words('english')) - set(['not','out','in']) 
    stop_words = stop_words.union(additional_stopwords)
    return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])

def fix_contractions(text):
    '''
    Expands contractions
    '''
    return contractions.fix(text)



# preprocessing pipeline

def preprocess(text):
    # convert to lower case
    lower_text = to_lower(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []      
            
    for each_sent in sentence_tokens:
        # fix contractions
        clean_text = fix_contractions(each_sent)
        # remove punctuation
        clean_text = remove_punct(clean_text)
        # filter out stop words
        clean_text = remove_stopwords(clean_text)
        # get base form of word
        wnl = WordNetLemmatizer()
        for part_of_speech in ['v']:
            lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
        # split the sentence into word tokens
        word_tokens = word_tokenize(lemmatized_word)
        for i in word_tokens:
            word_list.append(i)                     
    return word_list

# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'



sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)

Answer 1

词形还原与词干提取非常相似，它将一组词形变化的单词减少为一个常用单词。不同之处在于，词形还原将词形变化减少到其真正的词根，这称为引理。如果我们使用“amaze”、“amazing”、“amazingly”等词，那么所有这些词的引理都是“amaze”。与通常返回“amaz”的词干相比。一般来说，词形还原被认为比词干提取更先进。

words = ['amaze', 'amazed', 'amazing']

我们将再次使用 NLTK 进行词形还原。我们还需要确保下载了 WordNet 数据库，它将充当我们的词形还原器的查找，以确保它生成了真正的引理。

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print([lemmatizer.lemmatize(word) for word in words] )
#outputs ['amaze', 'amazed', 'amazing']

显然什么都没有发生，这是因为词形还原要求我们还提供词性（POS）标签，这是基于语法的单词类别。例如名词、形容词或动词。在我们的例子中，我们可以将每个单词作为动词，然后我们可以像这样实现：

from nltk.corpus import wordnet

print([lemmatizer.lemmatize(word, wordnet.VERB) for word in words])
#outputs['amaze', 'amaze', 'amaze']

为什么我的单词词形还原没有按预期工作？

问题描述投票：0回答：1

1个回答

最新问题

为什么我的单词词形还原没有按预期工作？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1