WordNet语料库中的词语说明

问题描述 投票:0回答:1

我想获得WordNet语料库中的单词长度

代码:

from nltk.corpus import wordnet as wn

len_wn = len([word.lower() for word in wn.words()])
print(len_wn)

我得到的输出为147306

我的问题:

  • 我在WordNet中得到单词的总长度吗?
  • tokens之类的zoom_in是否算作word
nlp nltk wordnet nltk-book
1个回答
0
投票

我在WordNet中得到单词的总长度吗?

取决于“单词”的定义。 wn.words()函数遍历所有lemma_nameshttps://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1191

def words(self, lang="eng"):
    """return lemmas of the given language as list of words"""
    return self.all_lemma_names(lang=lang)


def all_lemma_names(self, pos=None, lang="eng"):
    """Return all lemma names for all synsets for the given
    part of speech tag and language or languages. If pos is
    not specified, all synsets for all parts of speech will
    be used."""

    if lang == "eng":
        if pos is None:
            return iter(self._lemma_pos_offset_map)
        else:
            return (
                lemma
                for lemma in self._lemma_pos_offset_map
                if pos in self._lemma_pos_offset_map[lemma]
            )

因此,如果“单词”的定义是所有可能的引理,那么,是的,此函数为您提供了Wordnet中引理名称中单词的总长度:

>>> sum(len(lemma_name) for lemma_name in wn.words())
1692291
>>> sum(len(lemma_name.lower()) for lemma_name in wn.words())
1692291

不需要大写,因为引理名称应该被降低。甚至是命名实体,例如

>>> 'new_york' in wn.words()
True

但是请注意,相同的引理可以具有非常相似的引理名称:

>>> 'new_york' in wn.words()
True
>>> 'new_york_city' in wn.words()
True

这是因为wordnet的结构。 NLTK中的API将“含义”组织为同义集,包含的同义集链接到多个引理,每个引理至少带有一个名称:

>>> wn.synset('new_york.n.1')
Synset('new_york.n.01')

>>> wn.synset('new_york.n.1').lemmas()
[Lemma('new_york.n.01.New_York'), Lemma('new_york.n.01.New_York_City'), Lemma('new_york.n.01.Greater_New_York')]

>>> wn.synset('new_york.n.1').lemma_names()
['New_York', 'New_York_City', 'Greater_New_York']

但是您查询的每个“单词”可以有多个同义词集(即多种含义),例如

>>> wn.synsets('new_york')
[Synset('new_york.n.01'), Synset('new_york.n.02'), Synset('new_york.n.03')]

zoom_in之类的标记是否算作单词?

取决于“单词”的定义,如上面的示例,如果您迭代wn.words(),则要迭代lemma_names,并且new_york示例表明词缀名称中存在多单词表达式每个同义词集的列表。

© www.soinside.com 2019 - 2024. All rights reserved.