如何在Python中取消单词？

Question

我想知道我是否可以将它们解除为正常状态？

问题是我有数千个不同形式的单词，例如吃，吃，吃，吃等等，我需要计算每个单词的频率。所有这些 - 吃，吃，吃，吃等都会计入吃，因此，我使用了茎。

但问题的下一部分要求我在数据中找到相似的单词，我使用nltk的同义词来计算单词中的Wu-Palmer相似度。问题是nltk的同义词不会对词干词起作用，或者至少在这段代码中它们不会。 check if two words are related to each other

我该怎么办？有没有办法解除一个字？

Answer 1

我怀疑你的真正含义是“紧张”。正如你想要每个单词的不同时态每个计数到动词的“基本形式”。

看看pattern包

pip install pattern

然后使用en.lemma函数返回动词的基本形式。

import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"

Answer 2

不，没有。通过词干，你会失去信息，不仅仅是关于单词形式（如吃与吃或吃），还有关于单词本身的信息（如传统与传统）。除非您打算使用预测方法根据单词的上下文尝试预测此信息，否则无法将其取回。

Answer 3

理论上，唯一的方法是，如果在词干之前你保留了一个术语词典或任何类型的映射，并将这个映射继续进行你的其余计算。这种映射应该以某种方式捕获未经干扰的令牌的位置，并且当需要保留一个令牌时，如果你知道你的带柄令牌的原始位置，你将能够追溯并用你的映射恢复原始的未经过干扰的表示。

对于Bag of Words表示，这似乎是计算密集型的，并且以某种方式违背了BoW方法的统计性质的目的。

但理论上我再次认为它可行。我在任何实现中都没有看到过。

Answer 4

我认为一个好的方法就像在https://stackoverflow.com/a/30670993/7127519中说的那样。

可能的实现可能是这样的：

import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()

一个使用的提取器。这里有一个使用的文字：

complete_text = ''' cats catlike catty cat 
stemmer stemming stemmed stem 
fishing fished fisher fish 
argue argued argues arguing argus argu 
argument arguments argument '''

创建一个包含不同单词的列表：

my_list = []
#for i in complete_text.decode().split():
try: 
    aux = complete_text.decode().split()
except:
    aux = complete_text.split()
for i in aux:
    if i not in my_list:
        my_list.append(i.lower())
my_list

输出：

['cats',
 'catlike',
 'catty',
 'cat',
 'stemmer',
 'stemming',
 'stemmed',
 'stem',
 'fishing',
 'fished',
 'fisher',
 'fish',
 'argue',
 'argued',
 'argues',
 'arguing',
 'argus',
 'argu',
 'argument',
 'arguments']

现在创建字典：

aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict

哪个输出是：

{'argu': 'argue, argued, argues, arguing, argus, argu',
 'argument': 'argument, arguments',
 'cat': 'cats, cat',
 'catlik': 'catlike',
 'catti': 'catty',
 'fish': 'fishing, fished, fish',
 'fisher': 'fisher',
 'stem': 'stemming, stemmed, stem',
 'stemmer': 'stemmer'}

伴侣笔记本here。

Answer 5

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

您可能会喜欢这个使用Stemming的开源项目，并且包含一个算法来执行Inverse Stemming：

https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

关于this page of the project，有关于如何进行反向词干的解释。总而言之，它的工作原理如下。

首先，你将删除一些文档，这里是简短的（法语）字符串，例如删除了停用词：['sup chat march trottoir', 'sup chat aiment ronron', 'chat ronron', 'sup chien aboi', 'deux sup chien', 'combien chien train aboi']

然后诀窍是保留最流行的原始单词的计数与每个词干的计数：{'aboi': {'aboie': 1, 'aboyer': 1}, 'aiment': {'aiment': 1}, 'chat': {'chat': 1, 'chats': 2}, 'chien': {'chien': 1, 'chiens': 2}, 'combien': {'Combien': 1}, 'deux': {'Deux': 1}, 'march': {'marche': 1}, 'ronron': {'ronronner': 1, 'ronrons': 1}, 'sup': {'super': 4}, 'train': {'train': 1}, 'trottoir': {'trottoir': 1}}

最后，您现在可以猜测如何自己实现这一点。简单地说出一个词汇最多的原始单词。您可以参考以下实现，该实现在MIT许可下作为Multilingual-Latent-Dirichlet-Allocation-LDA项目的一部分提供：

https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py

可以通过抛弃非顶部反向词（例如通过使用堆）来进行改进，这将最终产生一个dict而不是dicts的dict。

如何在Python中取消单词？

问题描述投票：7回答：5

5个回答

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

最新问题

如何在Python中取消单词？

问题描述 投票：7回答：5

5个回答

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

最新问题

问题描述投票：7回答：5