为什么我的单词词形还原没有按预期工作?

问题描述 投票:0回答:1

stackoverflow 社区大家好! 长期读者,但第一次发帖。我目前正在尝试 NLP,在阅读了一些涉及该主题的论坛帖子后,我似乎无法让词形还原器正常工作(下面粘贴的功能)。比较我的原始文本和预处理文本,除了词形还原之外,所有清理步骤都按预期工作。我什至尝试指定词性:“v”不默认该单词作为名词,并且仍然获得动词的基本形式(例如:turned->turn,are->be,reading->read)。 ..但这似乎不起作用。

感谢另一双眼睛和反馈 - 谢谢!

# key imports

import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions


# cleaning functions

def to_lower(text):
    '''
    Convert text to lowercase
    '''
    return text.lower()

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation)

def remove_stopwords(text):
    '''
    Removes stop words which don't have meaning (ex: is, the, a, etc.)
    '''
    additional_stopwords = ['app']

    stop_words = set(stopwords.words('english')) - set(['not','out','in']) 
    stop_words = stop_words.union(additional_stopwords)
    return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])

def fix_contractions(text):
    '''
    Expands contractions
    '''
    return contractions.fix(text)



# preprocessing pipeline

def preprocess(text):
    # convert to lower case
    lower_text = to_lower(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []      
            
    for each_sent in sentence_tokens:
        # fix contractions
        clean_text = fix_contractions(each_sent)
        # remove punctuation
        clean_text = remove_punct(clean_text)
        # filter out stop words
        clean_text = remove_stopwords(clean_text)
        # get base form of word
        wnl = WordNetLemmatizer()
        for part_of_speech in ['v']:
            lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
        # split the sentence into word tokens
        word_tokens = word_tokenize(lemmatized_word)
        for i in word_tokens:
            word_list.append(i)                     
    return word_list

# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'



sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)
python nlp lemmatization
1个回答
0
投票

词形还原与词干提取非常相似,它将一组词形变化的单词减少为一个常用单词。不同之处在于,词形还原将词形变化减少到其真正的词根,这称为引理。如果我们使用“amaze”、“amazing”、“amazingly”等词,那么所有这些词的引理都是“amaze”。与通常返回“amaz”的词干相比。一般来说,词形还原被认为比词干提取更先进。

words = ['amaze', 'amazed', 'amazing']

我们将再次使用 NLTK 进行词形还原。我们还需要确保下载了 WordNet 数据库,它将充当我们的词形还原器的查找,以确保它生成了真正的引理。

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print([lemmatizer.lemmatize(word) for word in words] )
#outputs ['amaze', 'amazed', 'amazing']

显然什么都没有发生,这是因为词形还原要求我们还提供词性(POS)标签,这是基于语法的单词类别。例如名词、形容词或动词。在我们的例子中,我们可以将每个单词作为动词,然后我们可以像这样实现:

from nltk.corpus import wordnet

print([lemmatizer.lemmatize(word, wordnet.VERB) for word in words])
#outputs['amaze', 'amaze', 'amaze']
© www.soinside.com 2019 - 2024. All rights reserved.