如何使用 Haystack 识别与用户查询最匹配的前 k 个句子,然后返回包含这些句子的文档?

问题描述 投票:0回答:1

我有一组 1000 个文档(纯文本)和一个用户查询。我想使用 Python 库 HaystackFaiss 检索与用户查询最相关的前 k 个文档。特别地,我希望系统能够识别与用户查询最匹配的前 k 个句子,然后返回包含这些句子的文档。我怎样才能做到这一点?

以下代码标识与用户查询最匹配的前 k 个文档。我该如何更改它,以便代码识别与用户查询最接近的前 k 个句子,并返回包含这些句子的文档。

# Note: Most of the code is from https://haystack.deepset.ai/tutorials/07_rag_generator

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

import pandas as pd
from haystack.utils import fetch_archive_from_http

# Download sample
doc_dir = "data/tutorial7/"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/small_generator_dataset.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Create dataframe with columns "title" and "text"
#df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",")
df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",",nrows=10)
# Minimal cleaning
df.fillna(value="", inplace=True)

print(df.head())

from haystack import Document

# Use data to initialize Document objects
titles = list(df["title"].values)
texts = list(df["text"].values)
documents = []
for title, text in zip(titles, texts):
    documents.append(Document(content=text, meta={"name": title or ""}))

from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)

from haystack.nodes import RAGenerator, DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
)

# Delete existing documents in documents store
document_store.delete_documents()

# Write documents to document store
document_store.write_documents(documents)

# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)

from haystack.pipelines import GenerativeQAPipeline
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])

from haystack.utils import print_answers

QUESTIONS = [
    "who got the first nobel prize in physics",
    "when is the next deadpool movie being released",
]

for question in QUESTIONS:
    res = pipeline.run(query=question, params={"Retriever": {"top_k": 5}})
    print(res)
    #print_answers(res, details="all")

运行代码:

conda create -y --name haystacktest python==3.9
conda activate haystacktest
pip install --upgrade pip
pip install farm-haystack
conda install pytorch -c pytorch
pip install sentence_transformers
pip install farm-haystack[colab,faiss]==1.17.2

例如,我想知道是否有办法修改Faiss索引策略。

python indexing information-retrieval faiss haystack
1个回答
0
投票

正如 Stefano Fiorucci - anakin87 建议,可以将元数据添加到向量数据库中索引的文档中。因此,可以在向量数据库中索引每个句子,并使用元数据将每个句子链接回其原始文档。

© www.soinside.com 2019 - 2024. All rights reserved.