列表创建的项目数量错误

Question

我正在创建一个由元组组成的文档列表，每个元组都由一个元组列表和一个字符串组成，所以看起来像这样：

[([('NOUN', 'ADP'), ('ADP', 'NOUN'), ('NOUN', 'PROPN'), ('PROPN', 'ADJ'), ('ADJ', 'DET')], 'M'), 
('NOUN', 'ADP'), ('ADP', 'NOUN'), ('NOUN', 'PROPN'), ('PROPN', 'ADJ'), ('ADJ', 'DET')], 'F'), ...]

我正在使用nltk生成列表：

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt')
documents = [(list(ngrams(corpus.words(fileid), 2)), gender)
    for gender in [f[47] for f in corpus.fileids()]
    for fileid in corpus.fileids()]

问题是，len(corpus.fileids())为84（正确），但len(documents)为7056‬。因此，以某种方式，我设法使文档数量平方。我希望列表中只有84个项目。

我注意到documents[0]和documents[84]相同（当然documents[1]和documents[85]等也是如此）。我当然可以切片7056个项目的完整列表，但这并不能解释任何内容...我是Python和编程的新手，所以将不胜感激。

Answer 1

如果我正确地阅读了您的程序，您正在尝试将每个文档的列表以及文档的“性别”（即文件ID的索引47处的元素）一起存储在元组中。

用于构造documents的列表理解首先对内部列表理解进行迭代，然后对corpus.fileids()进行迭代。当Python列表推导迭代两个可迭代对象时，它将针对第一个可迭代对象的每个值迭代整个第二个可迭代对象。我们可以通过一个例子看到它：

>>> print([(a, b) for a in [1, 2] for b in [1, 2]])
[(1, 1), (1, 2), (2, 1), (2, 2)]

相反，在这种情况下，似乎可以通过将f[47]应用于从corpus.fileids()绘制的文件ID来避免重复迭代。这样，每个文件名将仅被考虑一次。

documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

整个程序因此变成

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt')
documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

列表创建的项目数量错误

问题描述投票：0回答：1

1个回答

最新问题

列表创建的项目数量错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1