我有代码:
loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=atlas_collection,
index_name=vector_search_index,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)
在结果中我每次都只有空数组。我不明白我做错了什么。这是 langchain 文档上的 link 以及示例。信息正常保存在数据库中,但我无法在该集合中搜索信息。
所以我可以使用以下代码让它在 MongoDB 中工作:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
DB_NAME = "langchain_db"
COLLECTION_NAME = "atlas_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
此时,我确实得到了与你相同的结果。在它起作用之前,我必须创建矢量搜索索引,并确保它的名称与
ATLAS_VECTOR_SEARCH_INDEX_NAME
: 中指定的名称相同
FWIW - 我在 Astra DB 中做起来更容易(我首先尝试了这个,因为我是 DataStax 员工):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
atlas_collection = "atlas_collection"
ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
vector_search = AstraDBVectorStore.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection_name=atlas_collection,
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
值得注意的是,Astra DB 会根据嵌入模型的维度自动创建向量索引。