POS后使用Wordnet将熊猫列合法化

问题描述 投票:0回答:1

我有一个带有文本的熊猫列df_travail[line_text]

我想对本专栏中的每个词进行词法化。

首先我将文本小写:

df_travail ['lowercase'] = df_travail['line_text'].str.lower()

然后,我将其标记化并应用POS(因为wordnet默认配置将每个单词都视为名词)。

from nltk import word_tokenize, pos_tag
tok_and_tag = lambda x: pos_tag(word_tokenize(x))
df_travail ['tok_and_tag'] = df_travail['lowercase'].apply(tok_and_tag)

然后我有以下内容:(整个df_travail['tok_and_tag']的摘录

"[('so', 'RB'), ('you', 'PRP'), (""'ve"", 'VBP'), ('come', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('master', 'NN'), ('for', 'IN'), ('guidance', 'NN'), ('?', '.'), ('is', 'VBZ'), ('this', 'DT'), ('what', 'WP'), ('you', 'PRP'), (""'re"", 'VBP'), ('saying', 'VBG'), (',', ','), ('grasshopper', 'NN'), ('?', '.')]"
[('actually', 'RB'), (',', ','), ('you', 'PRP'), ('called', 'VBD'), ('me', 'PRP'), ('in', 'IN'), ('here', 'RB'), (',', ','), ('but', 'CC'), ('yeah', 'UH'), ('.', '.')]

但是,考虑到我应用了POS的事实,我迷失了要使用(与Wordnet一起)应用的词形化功能?

编辑:以下链接未提及我的问题的POS部分Lemmatization of all pandas cells

python pandas nltk wordnet lemmatization
1个回答
0
投票

尝试以下示例:

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

adjective_tags = ['JJ','JJR','JJS']

def convert(text):
    lemmatized_text = []

    for word in POS_tag:
        if word[1] in adjective_tags:
            lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0],pos="a")))
        else:
            lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0]))) #default POS = noun

    return ' '.join(lemmatized_text)

df['text'] = df['text'].apply(lambda x: convert(x))
© www.soinside.com 2019 - 2024. All rights reserved.