添加文档时如何检查矢量存储中的重复文档?
目前我正在做类似的事情:
vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=embeddings
)
documents = vectorstore.get()['documents']
final_docs = list(filter(lambda x: x not in documents, final_docs))
vectorstore.add_documents(documents=final_docs, embedding=embeddings)
但是,我想知道大型数据集的性能。
另外,重复的文件在实践中会不会造成问题?根据我的理解,它们将嵌入到同一个向量中,因此唯一的开销似乎是不起作用的(即延迟)
您可以尝试计算每个文档的哈希值,然后将其添加到 python Set() 中,这将确保没有重复的文档
import hashlib
# Set to store hashes of documents
document_hashes = set()
def add_document(doc):
# Compute SHA256 hash of the document
doc_hash = hashlib.sha256(doc.encode()).hexdigest()
# Check if hash is in the set
if doc_hash in document_hashes:
print("Duplicate document detected!")
return False
else:
document_hashes.add(doc_hash)
#Your Code for adding document in vector stores
print("Document added successfully!")
return True
#Sample documents
doc1 = "This is document 1"
doc2 = "This is document 2"
doc3 = "This is document 1"
add_document(doc1)
add_document(doc2)
add_document(doc3)