我有一个带有文本的熊猫列df_travail[line_text]
。
我想对本专栏中的每个词进行词法化。
首先我将文本小写:
df_travail ['lowercase'] = df_travail['line_text'].str.lower()
然后,我将其标记化并应用POS(因为wordnet默认配置将每个单词都视为名词)。
from nltk import word_tokenize, pos_tag
tok_and_tag = lambda x: pos_tag(word_tokenize(x))
df_travail ['tok_and_tag'] = df_travail['lowercase'].apply(tok_and_tag)
然后我有以下内容:(整个df_travail['tok_and_tag']
的摘录
"[('so', 'RB'), ('you', 'PRP'), (""'ve"", 'VBP'), ('come', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('master', 'NN'), ('for', 'IN'), ('guidance', 'NN'), ('?', '.'), ('is', 'VBZ'), ('this', 'DT'), ('what', 'WP'), ('you', 'PRP'), (""'re"", 'VBP'), ('saying', 'VBG'), (',', ','), ('grasshopper', 'NN'), ('?', '.')]"
[('actually', 'RB'), (',', ','), ('you', 'PRP'), ('called', 'VBD'), ('me', 'PRP'), ('in', 'IN'), ('here', 'RB'), (',', ','), ('but', 'CC'), ('yeah', 'UH'), ('.', '.')]
但是,考虑到我应用了POS的事实,我迷失了要使用(与Wordnet一起)应用的词形化功能?
编辑:以下链接未提及我的问题的POS部分Lemmatization of all pandas cells
尝试以下示例:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
adjective_tags = ['JJ','JJR','JJS']
def convert(text):
lemmatized_text = []
for word in POS_tag:
if word[1] in adjective_tags:
lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0],pos="a")))
else:
lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0]))) #default POS = noun
return ' '.join(lemmatized_text)
df['text'] = df['text'].apply(lambda x: convert(x))