余弦相似度预处理任务

Question

我最近开始使用NLP。作为余弦相似度计算的一部分，我必须完成以下任务：

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(sent_1)
sent_2 = dictionary.doc2bow(sent_2)
sent_3 = dictionary.doc2bow(sent_3)

我有10000多个不同的句子（文档），所以我想生成一个代码，该代码自动在文档上进行迭代。我已经尝试了以下方法，但是不起作用：

sent_X = []
for i in documents:
    sent_X .append(dictionary.doc2bow(simple_preprocess(i)))

谢谢

Answer 1

我认为您的代码工作正常。我认为问题在于输出结果不是您期望的。因此，让我们看一个简单的示例，看看它是如何工作的：

>>> from gensim import corpora
>>> from gensim.utils import simple_preprocess

>>> documents = ["apple apple apple banana",
...              "hello hello this is a document"]
>>> dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
>>>
>>> sent_X = []
>>> for i in documents:
...     sent_X .append(dictionary.doc2bow(simple_preprocess(i)))
>>> sent_X
[[(0, 3), (1, 1)], [(2, 1), (3, 2), (4, 1), (5, 1)]]

我认为此结果（sent_X的输出）引起了您的困惑。让我们看一个更清晰的结果

>>> for doc in sent_X:
    print([[dictionary[id_], freq] for id_, freq in doc])
[['apple', 3], ['banana', 1]]
[['document', 1], ['hello', 2], ['is', 1], ['this', 1]]

余弦相似度预处理任务

问题描述投票：-1回答：1

1个回答

最新问题

余弦相似度预处理任务

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1