我正在为基于 LLM 的项目做一些 POC,为此我使用矢量数据库进行文档检索 (IR)。
最近,我看到一些来自一些最著名的矢量数据库的博客,其中建议使用混合搜索(矢量搜索+关键字搜索)以获得更好的IR。这也主要有助于特定领域 关键字。
因此,在开始实施混合搜索之前,我想做一些测试,并惊讶地发现所有这些博客都是错误的,因为通过矢量搜索,我能够从查询中匹配域特定关键字。
我的测试
生成了一些没有任何意义的关键字,而且, 不存在
我使用ChromaDB作为矢量数据库,它使用hnswlib 对于安
样本文件
{
"document_name": "Return Policy",
"Category": "Fashion",
"Product Name": "Zinsace",
"Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsace products are non-returnable."
},
{
"document_name": "Return Policy",
"Category": "Electronics",
"Product Name": "Zisava",
"Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisava products that have been used or show signs of damage are non-returnable."
},
{
"document_name": "Return Policy",
"Category": "Fashion",
"Product Name": "Zinsape",
"Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsape products are non-returnable."
},
{
"document_name": "Return Policy",
"Category": "Electronics",
"Product Name": "Zisada",
"Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisada products that have been used or show signs of damage are non-returnable."
}
索引和搜索脚本
import uuid
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from hybrid.dummy_data import DUMMY_DATA
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./hybrid"
))
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="XXXX",
model_name="text-embedding-ada-002"
)
st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name='all-mpnet-base-v2')
# st_ef_mini = embedding_functions.SentenceTransformerEmbeddingFunction()
texts = [doc['Policy'] for doc in DUMMY_DATA]
metadatas = [{k: v for k, v in d.items() if k != 'Policy'} for d in DUMMY_DATA]
collection = client.get_or_create_collection(name="mpnet", metadata={'hnsw:space': 'l2'},
embedding_function=st_ef)
ids = [str(uuid.uuid4()) for _ in texts]
collection.add(
documents=texts,
metadatas=metadatas,
ids=ids
)
res = collection.query(
query_texts=["I want to return Zinsace"],
n_results=10
)
print(res.get('documents'))
输出
[['Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized **Zinsace** products are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisada** products that have been used or show signs of damage are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisava** products that have been used or show signs of damage are non-returnable.']]
输出分析
我使用了 3 个模型进行嵌入
我索引了一些与退款政策相关的文档,产品名称非常随机(没有意义且不存在)
当我尝试查询
I want to return Zinsace
或I want to buy Zinsace
时,所有3个嵌入模型返回的第一个结果始终是正确的,并且能够进行精确的关键字匹配
这让我感到困惑,这些模型如何能够生成可以进行精确关键字匹配的嵌入,对于这些模型以前从未见过的单词也是如此。
如果矢量搜索能够进行关键字匹配,为什么所有矢量数据库人员都建议使用混合搜索。他们没有好好测试过吗?或者他们有什么偏见吗?
矢量搜索如何能够匹配精确的关键字(即使是随机生成且没有意义的单词)
因为嵌入使用子词分割,例如WordPieces:
为了改进对稀有单词的处理,我们将单词划分为一组有限的常见子单词单元(“单词片段”)。
您的示例非常“相似”,这意味着没有太多噪音,因此您的 ANN 查询就在现场。
如果您生成字典中不存在的随机单词的另一个原因是这些单词会被忽略。