[使用doc2vec构建词汇表时将列表转换为元组

问题描述 投票:0回答:1

[当尝试在SQuAD 2.0数据集上训练doc2vec数据时:

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])

我遇到此错误:

Python, TypeError: unhashable type: 'list'

我试图像这样将列表转换成元组,但是没有用:

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
tuples = tuple([x for x in tqdm(train_tagged.values)])
model_dbow.build_vocab(tuples)

部分代码:

import nltk
from nltk.corpus import stopwords
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_clean, test_size=0.2, random_state=42)


def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

train_tagged = df_clean.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['Context']), tags=[[r.Question], [r.Answer]]), axis=1)
test_tagged = df_clean.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['Context']), tags=[[r.Question], [r.Answer]]), axis=1)

import multiprocessing
cores = multiprocessing.cpu_count()
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
tuples = tuple([x for x in tqdm(train_tagged.values)])
model_dbow.build_vocab(tuples)
python nltk text-mining text-classification
1个回答
0
投票

构建词典(词汇参考)的规范方法以单词和其他信息(可能是部分语音来区分用法)作为关键。通常将词典放入某些O(1)查找表中,例如dict。对于任何这些,密钥都必须是可哈希的,这可以很好地转换为“不可变”。

您需要检查要发送到build_vocab的值。由于您无法提供期望的MRE,因此我只能推测,即使您的顶级序列是一个元组,elements可能也是一些非原子,不可散列的类型,例如一个简单的列表:

(
    ["set", speech.VERB_TRANS],
    ["set", speech.NOUN],
    ["set", speech.ADJ],
    ...
)

打印出tuples以查看其内容:我怀疑问题是内容,而不是外部形式。

© www.soinside.com 2019 - 2024. All rights reserved.