如何结合我自己的数据集和 huggingface 提供的“wiki_dpr”?

问题描述 投票:0回答:0

我试图将找到的文档添加到“https://huggingface.co/datasets/wiki_dpr”提供的“wiki_dpr”数据集中,但我找不到从该数据集中获取嵌入值的方法。

我目前正在尝试合并 wiki_dpr 和我自己的数据集。 但我不知道如何使嵌入值与 wiki_dpr 相同。

作为实验,我嵌入了wiki_dpr的id="7"的文本,但是这个结果和wiki_dpr有很大的不同。

我执行了下面的代码。

!pip install datasets evaluate transformers\[sentencepiece\]
!apt install libomp-dev
!pip install faiss-cpu
!pip install -U sentence-transformers
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

\#data
add_dpr_index = "7"
add_dpr_title = "Aaron"
add_dpr_text = "in literature dating to the Babylonian captivity and later. The books of Judges, Samuel and Kings mention priests and Levites, but do not mention the Aaronides in particular. The Book of Ezekiel, which devotes much attention to priestly matters, calls the priestly upper class the Zadokites after one of King David's priests. It does reflect a two-tier priesthood with the Levites in subordinate position. A two-tier hierarchy of Aaronides and Levites appears in Ezra, Nehemiah and Chronicles. As a result, many historians think that Aaronide families did not control the priesthood in pre-exilic Israel. What is clear is that high"

tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
model = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
input_ids = tokenizer(add_dprtext, return_tensors="pt")\["input_ids"\]
embeddings = model(input_ids).pooler_output
print(embeddings)

i期望该代码的结果为[0.12092622369527817,0.4741949737071991,-0.30444947385787964,...] ...] 这意味着方式不同。

我在哪里可以获得特定于 wiki_dpr 的嵌入模型?

python deep-learning nlp tensor wiki
© www.soinside.com 2019 - 2024. All rights reserved.