如何使用 Huggingface 的生物医学模型来获取文本嵌入?

问题描述 投票:0回答:1

我有生物医学文本,我正在尝试获取使用生物医学变压器的嵌入:

my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro anthocyanidins. As chocolate and cocoa product consumption, along with interest in them as functional foods, increases worldwide, there is a need to systematically and critically appraise the available clinical evidence on their health effects. A systematic search was conducted on electronic databases such as MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials (CENTRAL) using a search strategy and keywords. Among the many health effects assessed on several outcomes (including skin, cardiovascular, anthropometric, cognitive, and quality of life), we found that compared to controls, chocolate or cocoa product consumption significantly improved lipid profiles (triglycerides), while the effects of chocolate on all other outcome parameters were not significantly different. In conclusion, low-to-moderate-quality evidence with short duration of research (majority 4-6 weeks) showed no significant difference between the effects of chocolate and control groups on parameters related to skin, blood pressure, lipid profile, cognitive function, anthropometry, blood glucose, and quality of life regardless of form, dose, and duration among healthy individuals. It was generally well accepted by study subjects, with gastrointestinal disturbances and unpalatability being the most reported concerns."]

我发现我可以使用sentence-transformers很容易地获得文本嵌入(我假设我可以对所有句子的句子嵌入进行平均)。我发现这个SO答案使用相同的框架,并且似乎适用于任何(除非我错了)生物医学模型(例如,this):

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
document_embeddings = sbert_model.encode(pd.Series(['hello', 'cell type', 'protein']))
document_embeddings 

但是当我运行代码时,我得到了

No sentence-transformers model found with name /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

如果我理解正确,这意味着模型中的某些权重要么未使用,要么随机初始化,这意味着我不能信任这些生成的嵌入。

正确的方法是什么,如果说,我想使用那个

PubMedBERT
模型,或者像 BioBERT 这样的另一个模型?

machine-learning pytorch word-embedding huggingface language-model
1个回答
0
投票
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")

在您的系统上加载模型,然后使用以下代码将文本转换为嵌入。

import torch

# Assuming my_text is your input text
my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro anthocyanidins. As chocolate and cocoa product consumption, along with interest in them as functional foods, increases worldwide, there is a need to systematically and critically appraise the available clinical evidence on their health effects. A systematic search was conducted on electronic databases such as MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials (CENTRAL) using a search strategy and keywords. Among the many health effects assessed on several outcomes (including skin, cardiovascular, anthropometric, cognitive, and quality of life), we found that compared to controls, chocolate or cocoa product consumption significantly improved lipid profiles (triglycerides), while the effects of chocolate on all other outcome parameters were not significantly different. In conclusion, low-to-moderate-quality evidence with short duration of research (majority 4-6 weeks) showed no significant difference between the effects of chocolate and control groups on parameters related to skin, blood pressure, lipid profile, cognitive function, anthropometry, blood glucose, and quality of life regardless of form, dose, and duration among healthy individuals. It was generally well accepted by study subjects, with gastrointestinal disturbances and unpalatability being the most reported concerns."]

# Tokenize the text
encoded_input = tokenizer(my_text, padding=True, truncation=True, return_tensors='pt')

# Pass the tokenized input through the model
model_output = model(**encoded_input)

# Extract the embeddings
embeddings = model_output.last_hidden_state

# If you want to convert the embeddings to numpy array
embeddings_np = embeddings.detach().numpy()

# Now, embeddings_np contains the embeddings for your text
© www.soinside.com 2019 - 2024. All rights reserved.