使用 flan-t5-small 通过 LLM(私人)和 LangChain 或 LlamaIndex 进行摘要和主题提取

问题描述 投票:0回答:1

有人使用 Langchain 或 LlamaIndex 导入来处理超过 512 个代币的单个文档吗?是的,我知道还有其他方法来处理它,但是很难在网上找到详细说明如何将 LangChain 与可通过 API 调用访问的私有 LLM 一起使用的文档。大多数文档涉及商业化的法学硕士。如果您有的话,我将不胜感激一些策略或示例代码,它们将解释如何使用 langchain 处理 llm 包装器,特别是用于摘要和主题提取。

python langchain topic-modeling summarization
1个回答
0
投票

这里是使用

LangChain
编排 open-source LLM 的示例代码,用于嵌入和 txt2txtGen。文档是否具有 >512 个标记并不重要。您可以使用
loader.load_and_split()
函数加载大文档并将其拆分为较小的块(PDF 文档参考 > https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFaceHub
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval_qa.base import RetrievalQA

# embeddings = HuggingFaceEmbeddings(model_name='bert-base-uncased')
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
# docsearch = FAISS.from_documents(texts, embeddings)
docsearch = FAISS.from_texts(
    ["harry potter's owl is in the castle. The book is about 'To Kill A Mocking Swan'. There is another monkey"], embeddings)

llm = HuggingFaceHub(repo_id = "google/flan-t5-base",
                     model_kwargs={"temperature":0.6,"max_length": 500, "max_new_tokens": 200
                                  })

prompt_template = """
Compare the book given in question with others in the retriever based on genre and description.
Return a complete sentence with the full title of the book and describe the similarities between the books.

question: {question}
context: {context}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
retriever=docsearch.as_retriever()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs = {"prompt": prompt})
print(qa.run({"query": "Which book except 'To Kill A Mocking Bird' is similar to it?"}))
© www.soinside.com 2019 - 2024. All rights reserved.