如何对与相同标记具有相同含义的派生词进行分类？

Question

我想计算一篇文章中不相关的单词，但在将彼此衍生的具有相同含义的单词分组时遇到麻烦。

例如，我希望在gasoline和gas之类的句子中将The price of gasoline has risen.和"Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol".视为相同的标记，因此，如果这两句话构成了整篇文章，则gas的计数（或gasoline）将为3（petrol将不计算在内）。

我曾尝试使用NLTK的词干提取器和lemmatizers，但无济于事。大多数人似乎将gas复制为gas，将gasoline复制为gasolin，这对我的目的完全没有帮助。我了解这是通常的行为。我签出的thread似乎有点相似，但是这里的答案并不完全适用于我的情况，因为我需要单词彼此衍生。

如何将具有相同含义的派生词与相同标记一起对待？

Answer 1

我建议采用两步法：

[首先，通过比较单词嵌入来查找同义词（仅非停用词）。这应该删除相似的书面单词，这意味着其他含义，例如gasoline和gaseous。

然后，检查同义词是否共享某些词干。本质上是if "gas" is in "gasolin"，反之亦然。这足够了，因为您仅比较同义词即可。

import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6

#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
  if stemmer.stem(a) in stemmer.stem(b):
    return True
  if stemmer.stem(b) in stemmer.stem(a):
    return True
  return False

candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
  if a not in candidate_synonyms:
    if b not in candidate_synonyms:
      candidate_synonyms[a] = {a, b}
      return
    a, b = b,a
  candidate_synonyms[a].add(b)

nlp = spacy.load('en_core_web_lg') 

text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'

words = nlp(text)

for a, b in itertools.combinations(words, 2):
  #check if one of the word pairs are stopwords or punctuation
  if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
    continue
  if a.similarity(b) > threshold:
    if compare_stems(a.text.lower(), b.text.lower()):
      add_to_synonym_dict(a.text.lower(), b.text.lower())



print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}

然后，您可以根据文本中出现的同义词来计数它们。

注：我偶然选择了0.6的同义词阈值。您可能会测试哪个阈值适合您的任务。同样，我的代码只是一个简单而肮脏的示例，可以将其清理得更加干净。`

如何对与相同标记具有相同含义的派生词进行分类？

问题描述投票：0回答：1

1个回答

最新问题

如何对与相同标记具有相同含义的派生词进行分类？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1