[当尝试在SQuAD 2.0数据集上训练doc2vec数据时:
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])
我遇到此错误:
Python, TypeError: unhashable type: 'list'
我试图像这样将列表转换成元组,但是没有用:
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
tuples = tuple([x for x in tqdm(train_tagged.values)])
model_dbow.build_vocab(tuples)
部分代码:
import nltk
from nltk.corpus import stopwords
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_clean, test_size=0.2, random_state=42)
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
train_tagged = df_clean.apply(
lambda r: TaggedDocument(words=tokenize_text(r['Context']), tags=[[r.Question], [r.Answer]]), axis=1)
test_tagged = df_clean.apply(
lambda r: TaggedDocument(words=tokenize_text(r['Context']), tags=[[r.Question], [r.Answer]]), axis=1)
import multiprocessing
cores = multiprocessing.cpu_count()
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
tuples = tuple([x for x in tqdm(train_tagged.values)])
model_dbow.build_vocab(tuples)
构建词典(词汇参考)的规范方法以单词和其他信息(可能是部分语音来区分用法)作为关键。通常将词典放入某些O(1)查找表中,例如dict
。对于任何这些,密钥都必须是可哈希的,这可以很好地转换为“不可变”。
您需要检查要发送到build_vocab
的值。由于您无法提供期望的MRE,因此我只能推测,即使您的顶级序列是一个元组,elements可能也是一些非原子,不可散列的类型,例如一个简单的列表:
(
["set", speech.VERB_TRANS],
["set", speech.NOUN],
["set", speech.ADJ],
...
)
打印出tuples
以查看其内容:我怀疑问题是内容,而不是外部形式。