有人使用 Langchain 或 LlamaIndex 导入来处理超过 512 个代币的单个文档吗?是的,我知道还有其他方法来处理它,但是很难在网上找到详细说明如何将 LangChain 与可通过 API 调用访问的私有 LLM 一起使用的文档。大多数文档涉及商业化的法学硕士。如果您有的话,我将不胜感激一些策略或示例代码,它们将解释如何使用 langchain 处理 llm 包装器,特别是用于摘要和主题提取。
这里是使用
LangChain
编排 open-source LLM 的示例代码,用于嵌入和 txt2txtGen。文档是否具有 >512 个标记并不重要。您可以使用 loader.load_and_split()
函数加载大文档并将其拆分为较小的块(PDF 文档参考 > https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFaceHub
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval_qa.base import RetrievalQA
# embeddings = HuggingFaceEmbeddings(model_name='bert-base-uncased')
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
# docsearch = FAISS.from_documents(texts, embeddings)
docsearch = FAISS.from_texts(
["harry potter's owl is in the castle. The book is about 'To Kill A Mocking Swan'. There is another monkey"], embeddings)
llm = HuggingFaceHub(repo_id = "google/flan-t5-base",
model_kwargs={"temperature":0.6,"max_length": 500, "max_new_tokens": 200
})
prompt_template = """
Compare the book given in question with others in the retriever based on genre and description.
Return a complete sentence with the full title of the book and describe the similarities between the books.
question: {question}
context: {context}
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
retriever=docsearch.as_retriever()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs = {"prompt": prompt})
print(qa.run({"query": "Which book except 'To Kill A Mocking Bird' is similar to it?"}))