我使用langchain从mongoDB中的向量搜索中得到一个空数组

问题描述 投票:0回答:1

我有代码:

loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”

vector_search = MongoDBAtlasVectorSearch.from_documents(
  documents=docs,
  embedding=OpenAIEmbeddings(disallowed_special=()),
  collection=atlas_collection,
  index_name=vector_search_index,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)

在结果中我每次都只有空数组。我不明白我做错了什么。这是 langchain 文档上的 link 以及示例。信息正常保存在数据库中,但我无法在该集合中搜索信息。

python mongodb langchain py-langchain vector-search
1个回答
1
投票

所以我可以使用以下代码让它在 MongoDB 中工作:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)

DB_NAME = "langchain_db"
COLLECTION_NAME = "atlas_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")

client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

vector_search = MongoDBAtlasVectorSearch.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(disallowed_special=()),
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)

此时,我确实得到了与你相同的结果。在它起作用之前,我必须创建矢量搜索索引,并确保它的名称与

ATLAS_VECTOR_SEARCH_INDEX_NAME
:

中指定的名称相同

FWIW - 我在 Astra DB 中做起来更容易(我首先尝试了这个,因为我是 DataStax 员工):

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
atlas_collection = "atlas_collection"

ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")

vector_search = AstraDBVectorStore.from_documents(
  documents=docs,
  embedding=OpenAIEmbeddings(disallowed_special=()),
  collection_name=atlas_collection,
  api_endpoint=ASTRA_DB_API_ENDPOINT,
  token=ASTRA_DB_APPLICATION_TOKEN,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)

值得注意的是,Astra DB 会根据嵌入模型的维度自动创建向量索引。

© www.soinside.com 2019 - 2024. All rights reserved.