余弦相似度预处理任务

问题描述 投票:-1回答:1

我最近开始使用NLP。作为余弦相似度计算的一部分,我必须完成以下任务:

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(sent_1)
sent_2 = dictionary.doc2bow(sent_2)
sent_3 = dictionary.doc2bow(sent_3)

我有10000多个不同的句子(文档),所以我想生成一个代码,该代码自动在文档上进行迭代。我已经尝试了以下方法,但是不起作用:

sent_X = []
for i in documents:
    sent_X .append(dictionary.doc2bow(simple_preprocess(i)))

谢谢

python-3.x nlp cosine-similarity
1个回答
0
投票

我认为您的代码工作正常。我认为问题在于输出结果不是您期望的。因此,让我们看一个简单的示例,看看它是如何工作的:

>>> from gensim import corpora
>>> from gensim.utils import simple_preprocess

>>> documents = ["apple apple apple banana",
...              "hello hello this is a document"]
>>> dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
>>>
>>> sent_X = []
>>> for i in documents:
...     sent_X .append(dictionary.doc2bow(simple_preprocess(i)))
>>> sent_X
[[(0, 3), (1, 1)], [(2, 1), (3, 2), (4, 1), (5, 1)]]

我认为此结果(sent_X的输出)引起了您的困惑。让我们看一个更清晰的结果

>>> for doc in sent_X:
    print([[dictionary[id_], freq] for id_, freq in doc])
[['apple', 3], ['banana', 1]]
[['document', 1], ['hello', 2], ['is', 1], ['this', 1]]
© www.soinside.com 2019 - 2024. All rights reserved.