请求结果 3 的相似性搜索数量大于索引 0 中的元素数量,

问题描述 投票:0回答:1

我已经面临一个问题有一段时间了,尽管我阅读了 ChromaDB 文档并测试了不同的方法,但我仍然无法解决它。

当我尝试进行相似性搜索时,出现以下错误:

请求结果 3 的数量大于索引 0 中的元素数量,

下面是我的脚本

import os
import openai
import sys
import pypdf

#set-up OPEN_AI API key 
openai.api_key = os.environ["OPENAI_API_KEY"] #a restart was needed after the variable was set through the terminal 

os.getcwd()

# Start with LangChain
# Import and use YouTube document loader

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

from langchain.document_loaders import PyPDFLoader

#start with one and then scale

url1="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #Should you Change your job? J P Explains the risks of (not) quitting your job
url2="https://www.youtube.com/shorts/BnYK848GcAA" #How to handle emotional pain
url3="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #https://www.youtube.com/shorts/4qMyHwmnQHk
save_dir="docs/youtube/"



loader = GenericLoader(
    YoutubeAudioLoader([url1],save_dir),
    OpenAIWhisperParser()
)
loader2 = GenericLoader(
    YoutubeAudioLoader([url2],save_dir),
    OpenAIWhisperParser()
)
loader3 = GenericLoader(
    YoutubeAudioLoader([url3],save_dir),
    OpenAIWhisperParser()
)

videos = []

videos.extend(loader.load())
videos.extend(loader2.load())
videos.extend(loader3.load())


print(len(videos))

#document splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 30
)

splits = text_splitter.split_documents(videos)

print(len(splits))
print(splits)

#embeddings
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

#persist_directory = 'chroma/'
#!rm -rf ./docs/chroma  # remove old database files if any


vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory="docs/youtube/chroma/"
)

print(vectordb._collection.count())

#Similarity search. Initial chechs

question = "What is the main topic of the text?"

sim1 = vectordb.similarity_search(question,k=3)

print(len(sim1))

问题是嵌入还是 ChromaDB 索引?

openai-api embedding langchain large-language-model chromadb
1个回答
0
投票

该问题是由于对这篇文章的沟通不畅造成的:https://github.com/imartinez/privateGPT/issues/1012

请不要按照 C:\Users\phyln\AppData\Local\Programs\Python\Python311\Lib\site-package 中的建议注释第 73 行

© www.soinside.com 2019 - 2024. All rights reserved.