我正在使用 Doc2Vec 模型来计算网站文本数据集中观察值之间的余弦相似度。如果我使用 Fasttext(根据我的数据进行训练)或 Longformer(预先训练)[我知道它们不会相同],我想确保我的测量结果“大致”一致。然而,成对余弦相似性度量在 Doc2Vec 与 Longformer 或 Doc2Vec 与 Fasttext 之间“呈强负相关”。该指标在 Longformer 和 Fasttext 之间呈正相关。人们有理由期待这一点吗?我是否错误地在代码中做了一些可能导致此问题的事情?
# PREPARE DATA
website_df = pd.read_csv(data_path+'cleaned_docdf_may2023.csv')
website_df[['documents_cleaned','website']]=website_df[['documents_cleaned','website']].astype(str)
website_df['documents_cleaned']=website_df['documents_cleaned'].str.lower()
website_df['documents_cleaned']=website_df['documents_cleaned'].str.strip()
#######################
# Train Doc2vec model
#######################
# Clean data for model input (trim long docs, lower case, tokenize):
counter = 0
all_docs = []
all_docs_simple = []
for train_doc in website_df.documents_cleaned:
doc = train_doc[:150000] if len(train_doc) > 150000 else train_doc
# clean using simple_preprocess for Fasttext model input
simple_pre = gensim.utils.simple_preprocess(train_doc)
doc = remove_stopwords(doc)
doc_tokens =nltk.word_tokenize(doc.lower())
all_docs.append(doc_tokens)
all_docs_simple.append(simple_pre)
if (counter%100) == 0:
print("{0} .. len: {1}".format(counter,len(doc)))
counter += 1
# Creating all tagged documents
documents_websites = [TaggedDocument(doc, [i]) for i, doc in enumerate(all_docs)]
documents_simplepre_websites = [TaggedDocument(doc, [i]) for i, doc in enumerate(all_docs_simple)]
print("\t. Run model")
doc2vec_model_websites = Doc2Vec(documents = documents_websites,
vector_size=700,
window=7,
min_count =3)
print("\t. Done")
doc2vec_model_websites.save(data_path + "doc2vec_websites.model")
# Grab document level vectors
vectors_d2v_websites = doc2vec_model_websites.dv.get_normed_vectors()
#########################
# FASTTEXT MODEL
#########################
# create and save Fasttext input
sent_df_websites=pd.Series(documents_simplepre_websites)
with open(data_path + 'sentences_websites', 'a') as f:
df_string = sent_df_websites.to_string(header=False, index=False)
f.write(df_string)
# Skipgram model (use comparable model parameters to doc2vec model) :
ft_model_sg_websites = fasttext.train_unsupervised(input=sent_df_websites, model='skipgram', ws=7, epoch=10, minCount=3)
ft_model_sg_websites.save_model(data_path + "ft_websites_sg.bin")
# cbow model (use comparable model parameters to doc2vec model) :
ft_model_cbow_websites = fasttext.train_unsupervised(input=data_path + 'sentences_websites', model='cbow', ws=7, epoch=10, minCount=3)
ft_model_cbow_websites.save_model(data_path + "ft_websites_cbow.bin")
def generateVector(sentence):
return ft.get_sentence_vector(sentence)
ft = ft_model_sg_websites
website_df['embeddings_sg'] = website_df['documents_cleaned'].apply(generateVector)
embeddings_sg_website=website_df['embeddings_sg']
ft = ft_model_cbow_websites
website_df['embeddings_cbow'] = website_df['documents_cleaned'].apply(generateVector)
embeddings_cbow_website=website_df['embeddings_sg']
#########################
# LONGFORMER
#########################
model_name = 'allenai/longformer-base-4096'
tokenizer = LongformerTokenizer.from_pretrained(model_name)
model = LongformerModel.from_pretrained(model_name)
def get_longformer_embeddings(text):
encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input, output_hidden_states=True)
embeddings = output.last_hidden_state
avg_emb = embeddings.mean(dim=1)
return avg_emb.cpu().detach().numpy()
def get_cosine_sim(a,b):
value = dot(a, b)/(norm(a)*norm(b))
return value
# subset data for speed during pilot
website_subset = website_df[:100]
website_subset['embeddings'] = website_subset['documents_cleaned'].apply(get_longformer_embeddings)
#########################
# EVALUATE CONSISTENCY BETWEEN MODELS
#########################
# create dataframe of random pairwise combinations
rand = np.random.randint(1,100,size=(500,2))
df = pd.DataFrame(rand, columns=['rand1', 'rand2'])
df['sim_lf']=0
df['sim_dv']=0
df['sim_ft_sg']=0
df['sim_ft_cbow']=0
for ind in df.index:
a_loc = df['rand1'][ind]
b_loc = df['rand2'][ind]
a_vec_dv = vectors_d2v_websites[a_loc]
b_vec_dv = vectors_d2v_websites[b_loc]
a_vec_ft_sg = embeddings_sg_website[a_loc]
b_vec_ft_sg = embeddings_sg_website[b_loc]
a_vec_ft_cbow = embeddings_cbow_website[a_loc]
b_vec_ft_cbow = embeddings_cbow_website[b_loc]
a_vec_lf = website_subset['embeddings'][a_loc]
b_vec_lf = website_subset['embeddings'][b_loc].T
cos_sim_lf = get_cosine_sim(a_vec_lf, b_vec_lf)
cos_sim_dv = get_cosine_sim(a_vec_dv, b_vec_dv)
cos_sim_ft_sg = get_cosine_sim(a_vec_ft_sg,b_vec_ft_sg)
cos_sim_ft_cbow = get_cosine_sim(a_vec_ft_cbow,b_vec_ft_cbow)
df['sim_lf'][ind]=cos_sim_lf
df['sim_dv'][ind]=cos_sim_dv
df['sim_ft_sg'][ind]=cos_sim_ft_sg
df['sim_ft_cbow'][ind]=cos_sim_ft_cbow
print('MY WEBSITE DATA SIMILARITY')
print('corr(Longformer, Fasttext (skipgram)) = ',df['sim_lf'].corr(df['sim_ft_sg']))
print('corr(Longformer, Fasttext (cbow)) = ',df['sim_lf'].corr(df['sim_ft_cbow']))
print('corr(Longformer, d2v) = ',df['sim_lf'].corr(df['sim_dv']))
print('corr(Fasttext (skipgram), d2v) = ',df['sim_ft_sg'].corr(df['sim_dv']))
print('corr(Fasttext (cbow), d2v) = ',df['sim_ft_cbow'].corr(df['sim_dv']))
训练具有完整 700 个维度的模型需要大量
您的 Fasttext 模型通过未指定尺寸来接受默认的 100 个尺寸。 看来您正在使用预训练的 Longformer 模型,该模型是其他人根据大量数据进行训练的。
所以我的第一个猜测是,您可能没有足够的数据来训练该大小的
Doc2Vec
模型。为了与在相同数量的文本上训练的 100 维 Fasttext 模型进行比较,尝试 100 维
Doc2Vec
也更有意义。 (根据您的文本量,您可能想要更大或更小。)
最后:独立于不同技术之间的任何互相关性,我还想知道:哪种技术提供的文档间相似性结果在临时或正式评估中看起来最好。了解表现最好的人——也许与其他人不相关! – 似乎比不同方法之间的相似性更重要。 (如果,也许,你有其他方法同样适合你的真正目标,但彼此不相关,那么它们的混合可能会带来很大的提升。)