如何使用 Huggingface 的生物医学模型来获取文本嵌入?

my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro anthocyanidins. As chocolate and cocoa product consumption, along with interest in them as functional foods, increases worldwide, there is a need to systematically and critically appraise the available clinical evidence on their health effects. A systematic search was conducted on electronic databases such as MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials (CENTRAL) using a search strategy and keywords. Among the many health effects assessed on several outcomes (including skin, cardiovascular, anthropometric, cognitive, and quality of life), we found that compared to controls, chocolate or cocoa product consumption significantly improved lipid profiles (triglycerides), while the effects of chocolate on all other outcome parameters were not significantly different. In conclusion, low-to-moderate-quality evidence with short duration of research (majority 4-6 weeks) showed no significant difference between the effects of chocolate and control groups on parameters related to skin, blood pressure, lipid profile, cognitive function, anthropometry, blood glucose, and quality of life regardless of form, dose, and duration among healthy individuals. It was generally well accepted by study subjects, with gastrointestinal disturbances and unpalatability being the most reported concerns."]


from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
document_embeddings = sbert_model.encode(pd.Series(['hello', 'cell type', 'protein']))


No sentence-transformers model found with name /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



模型,或者像 BioBERT 这样的另一个模型?

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")


import torch

# Assuming my_text is your input text
# Tokenize the text
encoded_input = tokenizer(my_text, padding=True, truncation=True, return_tensors='pt')

# Pass the tokenized input through the model
model_output = model(**encoded_input)

# Extract the embeddings
embeddings = model_output.last_hidden_state

# If you want to convert the embeddings to numpy array
embeddings_np = embeddings.detach().numpy()

# Now, embeddings_np contains the embeddings for your text
