在 Python 中,我有一个文本查询变量和一个结构如下的数据集:
text = "hey how are you doing today love"
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
我正在尝试使用以下管道来计算文本和数据集的 Dolly 嵌入之间的余弦相似度,如下所示:
# Import Pipeline
from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
# Create Feature Extraction Object
feature_extraction = pipeline('feature-extraction',
model='databricks/dolly-v2-3b',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto")
# Define Inputs
text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
# Create Embeddings
text_embeddings = feature_extraction(text)[0]
dataset_embeddings = feature_extraction(dataset)
text_embeddings = np.array(text_embeddings)
dataset_embeddings = np.array(dataset_embeddings)
text_embeddings = normalize(text_embeddings, norm='l2')
dataset_embeddings = normalize(dataset_embeddings, norm='l2')
cosine_similarity = np.dot(text_embeddings, dataset_embeddings.T)
angular_distance = np.arccos(cosine_similarity) / np.pi
L2 规范化失败,如果我“注释掉”我会遇到以下错误:
ValueError: shapes (1,7,2560) and (1,3) not aligned: 2560 (dim 2) != 1 (dim 0)
我知道错误与text_embeddings和dataset_embeddings的形状未对齐有关。但是,我不确定我能做些什么来解决它。
帮助!
这里发生了几件事:
dolly-v2-3b
为给定的文本输入提供多个嵌入,其中嵌入的数量取决于您提供的输入。例如,虽然模型为dataset
中的第一句话提供了 7 个嵌入(也称为向量),但它为后续的 2.cosine similarity
测量two向量之间的相似性。您提供的代码试图将一个句子的多个向量与另一个句子的多个向量进行比较;这违反了 cosine similarity
执行的上述操作。因此,在执行相似性计算之前,我们需要将嵌入“压缩”到一个向量中——下面的代码使用了一种称为“向量平均”的技术,它简单地计算向量的平均值。np.average
中为每个句子单独调用np.normalize
(用于向量平均)和dataset
。下面的代码运行没有错误,并返回
cosine similarity
的 1
对于我们将句子与自身进行比较的第一次比较,这是预期的。此外,第一次比较的两个相同向量之间未定义的np.NaN
角度差异也是有道理的。希望这有帮助!
# Installations required in Google Colab
# %pip install transformers
# %pip install torch
# %pip install accelerate
from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
# Create Feature Extraction Object
feature_extraction = pipeline('feature-extraction',
model='databricks/dolly-v2-3b',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto")
# Define Inputs
text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
# Create Embeddings
text_embeddings = feature_extraction(text)
dataset_embeddings = feature_extraction(dataset)
# Perform Vector Averaging
text_embeddings_avg = np.average(text_embeddings[0], axis=1)
dataset_embeddings_avg = np.array(
[
np.average(text_embedding, axis=1)
for text_embedding
in dataset_embeddings
]
)
print(text_embeddings_avg.shape) # (1, 2560)
print(dataset_embeddings_avg.shape) # (3, 1, 2560)
# Perform Normalization
text_embeddings_avg_norm = normalize(text_embeddings_avg, norm='l2')
dataset_embeddings_avg_norm = np.array(
[
normalize(text_embedding, norm='l2')
for text_embedding
in dataset_embeddings_avg
]
)
print(text_embeddings_avg_norm.shape) # (1, 2560)
print(dataset_embeddings_avg_norm.shape) # (3, 1, 2560)
# Cosine Similarity
cosine_similarity = np.array(
[
np.dot(text_embeddings_avg_norm, text_embedding.T)
for text_embedding
in dataset_embeddings_avg_norm
]
)
angular_distance = np.arccos(cosine_similarity) / np.pi
print(cosine_similarity.tolist()) # [[[1.0000000000000007]], [[0.7818918337438344]], [[0.7921756683919716]]]
print(angular_distance.tolist()) # [[[nan]], [[0.21425490131377858]], [[0.2089483418862303]]]