词干化和词形还原之间的比较

问题描述 投票:0回答:1

根据多项研究,我发现以下重要的比较分析:

如果我们查看文本,词形还原很可能应该返回更正确的输出,对吧?不仅是正确的,而且是缩短的版本,我在这一行做了一个实验:

sentence ="having playing  in today gaming ended with greating victorious"

但是当我运行词形还原和词干化的代码时,我得到以下结果:

['have', 'play', 'in', 'today', 'game', 'end', 'with', 'great', 'victori'] ['having', 'playing', 'in', 'today', 'gaming', 'ended', 'with', 'greating', 'victorious']

第一个是词干,除了胜利(应该是胜利)之外,一切看起来都很好,第二个是词形还原(所有这些都是正确的,但都是原始形式),所以在这种情况下哪个选项是好的?简短的版本,大部分都是不正确的或长版本且正确吗?

        import nltk
        from nltk.tokenize import word_tokenize,sent_tokenize
        from nltk.corpus import stopwords
        from sklearn.feature_extraction.text import  CountVectorizer
        from nltk.stem import PorterStemmer,WordNetLemmatizer
        mylematizer =WordNetLemmatizer()
        mystemmer =PorterStemmer()
        nltk.download('stopwords')
        sentence ="having playing  in today gaming ended with greating victorious"
        words =word_tokenize(sentence)
        # print(words)
        stemmed =[mystemmer.stem(w)  for w in words]
        lematized=[mylematizer.lemmatize(w) for w in words ]
        print(stemmed)
        print(lematized)
        # mycounter =CountVectorizer()
        # mysentence ="i love ibsu. because ibsu is great university"
        # # print(word_tokenize(mysentence))
        # # print(sent_tokenize(mysentence))
        # individual_words=word_tokenize(mysentence)
        # stops =list(stopwords.words('english'))
        # words =[w  for w in  individual_words if w not in  stops  and  w.isalnum() ]
        # reduced =[mystemmer.stem(w) for w  in words]
        
        # new_sentence =' '.join(words)
        # frequencies =mycounter.fit_transform([new_sentence])
        # print(frequencies.toarray())
        # print(mycounter.vocabulary_)
        # print(mycounter.get_feature_names_out())
        # print(new_sentence)
        # print(words)
        # # print(list(stopwords.words('english')))
python nltk stemming lemmatization
1个回答
0
投票

以下是词形还原器对字符串中的单词使用的词性的示例:

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wordnet.NOUN)
tag_map['J'] = wordnet.ADJ
tag_map['V'] = wordnet.VERB
tag_map['R'] = wordnet.ADV

sentence = "having playing in today gaming ended with greating victorious"
tokens = word_tokenize(sentence)
wnl = WordNetLemmatizer()
for token, tag in pos_tag(tokens):
    print('found tag', tag[0])
    lemma = wnl.lemmatize(token, tag_map[tag[0]])
    print(token, "lemmatized to", lemma)

输出:

found tag V
having lemmatized to have
found tag N
playing lemmatized to playing
found tag I
in lemmatized to in
found tag N
today lemmatized to today
found tag N
gaming lemmatized to gaming
found tag V
ended lemmatized to end
found tag I
with lemmatized to with
found tag V
greating lemmatized to greating
found tag J
victorious lemmatized to victorious

词形还原将单词提炼为其基本形式。它与词干提取类似,但为单词带来上下文,从而将具有相似含义的单词链接到一个单词。一个奇特的语言词是“形态学”。那么在给定语言中,单词之间是如何相互关联的呢?如果您查看上面的输出,就会发现 ing 动词被解析为名词。 ing动词,而动词,也可以用作名词:我喜欢游泳。动词是爱,名词是游泳。这就是上面对标签的解释方式。说实话,你上面这句话也不是一个句子。我不会说一个比另一个正确,但当在具有独立子句或从属子句和独立子句的句子中正确使用词性时,认为词形还原更强大。

© www.soinside.com 2019 - 2024. All rights reserved.