我创建了一些诗歌和童谣的 solr 索引(版本 9.3.0)。我正在尝试搜索相关的诗歌和童谣,并希望获取每个匹配文档的点积距离。我找不到任何方法来取回该信息。这是我添加到 solr 的 driven-schema 文件中的字段:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="384"
similarityFunction="dot_product" knnAlgorithm="hnsw"
hnswMaxConnections="16" hnswBeamWidth="50"/>
<field name="bge_small_vector" type="knn_vector" indexed="true" stored="true"/>
这是我用来查询 solr 索引的 python 代码:
import pysolr
from encoder import Encoder
from sentence_transformers import SentenceTransformer
import pprint
pp = pprint.PrettyPrinter(indent=4, width=100)
solr = pysolr.Solr('http://localhost:8983/solr/docindex')
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
document = '''Three blind mice. Three blind mice.
See how they run. See how they run.
They all ran after the farmer's wife,
Who cut off their tails with a carving knife.
Did you ever see such a sight in your life
As three blind mice?'''
embedding = model.encode(document, normalize_embeddings=True, convert_to_numpy=True)
solr_response=solr.search(
q=r'{!knn f=bge_small_vector topK=10}[' + ",".join([f'{a:.12f}' for a in embedding]) + ']',
rows=10,
start=0,
debugQuery="true",
wt='json')
for item in solr_response:
pp.pprint(item)
pp.pprint(solr_response.debug)
我能找到的唯一关于距离的参考是在调试响应中,它不特定于任何文档:
{ 'QParser': 'KnnQParser',
'explain': {'': '\n**0.81944466 = within top 10**\n'},
'parsedquery': 'KnnVectorQuery(KnnVectorQuery:bge_small_vector[-0.02721269,...][10])',
'parsedquery_toString': 'KnnVectorQuery:bge_small_vector[-0.02721269,...][10]',
...
}
有谁知道如何让 solr 返回 DenseVectorField 查询中每个文档的距离?
来自https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/的论文,它展示了OpenSearch中将距离转换为分数的方法。 我刚刚测试了 L2 距离分数 = 1 / (1 + 距离),因此距离 = (1 / 分数) - 1。对于欧氏距离,您可能需要对结果取平方根。