如何高效地检查vectorstore中的重复文档？

Question

添加文档时如何检查矢量存储中的重复文档？

目前我正在做类似的事情：

vectorstore = Chroma(
        persist_directory=persist_dir,
        embedding_function=embeddings
    )

documents = vectorstore.get()['documents']
final_docs = list(filter(lambda x: x not in documents, final_docs))
vectorstore.add_documents(documents=final_docs, embedding=embeddings)

但是，我想知道大型数据集的性能。

另外，重复的文件在实践中会不会造成问题？根据我的理解，它们将嵌入到同一个向量中，因此唯一的开销似乎是不起作用的（即延迟）

Answer 1

您可以尝试计算每个文档的哈希值，然后将其添加到 python Set() 中，这将确保没有重复的文档

    import hashlib

    # Set to store hashes of documents
    document_hashes = set()

    def add_document(doc):
        # Compute SHA256 hash of the document
        doc_hash = hashlib.sha256(doc.encode()).hexdigest()
        
        # Check if hash is in the set
        if doc_hash in document_hashes:
            print("Duplicate document detected!")
            return False
        else:
            document_hashes.add(doc_hash)
            #Your Code for adding document in vector stores
            print("Document added successfully!")
            return True

    #Sample documents 
    doc1 = "This is document 1"
    doc2 = "This is document 2"
    doc3 = "This is document 1"  

    add_document(doc1) 
    add_document(doc2) 
    add_document(doc3)

如何高效地检查vectorstore中的重复文档？

问题描述投票：0回答：1

1个回答

最新问题

如何高效地检查vectorstore中的重复文档？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1