当应用word2vec时，仅显示字符而不显示单词？

Question

这是我的代码，您可以看到我将句子与单词对齐，但是当我申请时仍然有问题我的句子中的word2vec模型我使用阿拉伯文anaconda版本4.7.12

sentences = nltk.sent_tokenize(str(sentences1))
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
for i in range(len(sentences)):
sentences[i] = [word for word in sentences[i] if word not in stopwords.words('arabic')]

sentences = re.sub(r'[^\w\s]','',(str(sentences)))
sentences = re.sub("\d+", "", sentences)
sentences =sentences.strip()
sentences = nltk.word_tokenize(sentences)
from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count=1)
words1 = model.wv.vocab

在单词1中，vocab刚刚显示了字母

Answer 1

您的标记化和预处理看起来有些混乱。在顶部进行单词标记之后，将其重新变成在线上的巨大字符串...

sentences = re.sub(r'[^\w\s]','',(str(sentences)))

在该字符串上执行另一个单词标记化仍然只剩下一个列表，其中每个项目都是简单单词-这意味着每个假定的“句子”现在只是一个单词，因此它是每个“句子”的字符列表，而不是单词列表。

我建议在进行单词加标记之前对字符串进行基于字符串的清理，并且仅将单词加标记作为最后一步。并且，在传递到sentences之前，请确认Word2Vec具有所需的格式。例如，如果要执行：

print(str(iter(sentences).next())

您应该看到sentences中的第一项，它应该是单词列表，而不仅仅是字符串。（如果只是字符串，您将看到所看到的症状-Word2Vec仅学习单个字符。）

当应用word2vec时，仅显示字符而不显示单词？

问题描述投票：0回答：1

1个回答

最新问题

当应用word2vec时，仅显示字符而不显示单词？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1