使用 Langchain 和许多 .txt 文件查询 GPT4All 本地模型 - KeyError: 'input_variables'

问题描述 投票:0回答:2

python 3.8, Windows 10, neo4j==5.14.1, langchain==0.0.336

我正在尝试利用本地 Langchain 模型(GPT4All)来帮助我通过查询将加载的

.txt
文件语料库转换为
neo4j
数据结构。我在下面提供了一个最小的可重现示例代码,以及对我尝试模拟的文章/存储库的引用。我还提供了一个应包含在查询中的“上下文”以及所有
Document
对象。我仍在学习如何使用
Langchain
所以我真的不知道我在做什么,但我得到的当前回溯看起来像这样:

Traceback (most recent call last):
  File ".\neo4jmain.py", line xx, in <module>
    prompt_template = PromptTemplate(
  File "C:\Users\chalu\AppData\Local\Programs\Python\Python38\lib\site-packages\langchain\load\serializable.py", line 97, in __init__
    super().__init__(**kwargs)
  File "pydantic\main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic\main.py", line 1102, in pydantic.main.validate_model
  File "C:\Users\chalu\AppData\Local\Programs\Python\Python38\lib\site-packages\langchain\schema\prompt_template.py", line 76, in validate_variable_names
    if "stop" in values["input_variables"]:
KeyError: 'input_variables'

正如你所看到的,我实际上并没有在任何地方定义

input_variables
,所以我认为这是 Langchain 的默认行为,但同样,不确定。我也遇到错误:

LLaMA ERROR: The prompt is 5161 tokens and the context window is 2048!
ERROR: The prompt size exceeds the context window size and cannot be processed.

...这显然是查询字符串本身太大的结果。我希望能够查询我的文档以获取答案,同时为模型提供要参考的文档。我怎样才能做到这一点? Langchain 文档对于这个领域的菜鸟来说并不是很好,它到处都是,并且缺乏许多简单的菜鸟用例,所以我在这里问它。

# https://medium.com/neo4j/enhanced-qa-integrating-unstructured-and-graph-knowledge-using-neo4j-and-langchain-6abf6fc24c27
# https://github.com/sauravjoshi23/ai/blob/main/retrieval%20augmented%20generation/integrated-qa-neo4j-langchain.ipynb

# Script to convert a corpus of many text files into a neo4j graph

# Imports
import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms.gpt4all import GPT4All
from langchain.prompts import PromptTemplate
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def bert_len(text):
    """Return the length of a text in BERT tokens."""
    tokens = tokenizer.encode(text)
    return len(tokens)

def get_files(path: str) -> list:
    """Return a list of all files in a directory, recursively."""
    files = []
    for file in os.listdir(path):
        file_path = os.path.join(path, file)
        if os.path.isdir(file_path):
            files.extend(get_files(file_path))
        else:
            files.append(file_path)
    return files

# Get the text files
all_txt_files = get_files('data')
raw_txt_files = []
for current_file in all_txt_files:
    raw_txt_files.extend(TextLoader(current_file, encoding='utf-8').load())

# Create a text splitter object that will help us split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1024, # 200,
    chunk_overlap = 128, # 20
    length_function = bert_len,
    separators=['\n\n', '\n', ' ', ''],
)

# Split the text into "documents"
documents = text_splitter.create_documents([raw_txt_files[0].page_content])

# Utilizing these Document objects, we want to query the GPT4All model to help us create
# a JSON object that contains the ontology of terms mentioned in the given context,
# while mitigating "max_tokens" error.
# Create a PromptTemplate object that will help us create the prompt for GPT4All(?)
prompt_template = PromptTemplate(
    template = """
    You are a network graph maker who extracts terms and their relations from a given context.
    You are provided with a context chunk (delimited by ```). Your task is to extract the ontology
    of terms mentioned in the given context. These terms should represent the key concepts as per the context.
    
    Thought 1: While traversing through each sentence, Think about the key terms mentioned in it.
        Terms may include object, entity, location, organization, person,
        condition, acronym, documents, service, concept, etc.
        Terms should be as atomistic as possible
    
    Thought 2: Think about how these terms can have one on one relation with other terms.
        Terms that are mentioned in the same sentence or the same paragraph are typically related to each other.
        Terms can be related to many other terms
        
    Thought 3: Find out the relation between each such related pair of terms.

    Format your output as a list of json. Each element of the list contains
    a pair of terms and the relation between them, like the following:
    [Dict("node_1": "A concept from extracted ontology",
            "node_2": "A related concept from extracted ontology",
            "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences",
        ),
    Dict("node_1": "A concept from extracted ontology",
        "node_2": "A related concept from extracted ontology",
        "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences",
    ),
    Dict(...)]
    Context Documents: {documents}
    """,
    variables = {
        "documents": documents,
    }
)

# Create a GPT4All object that will help us query the GPT4All model
llm = GPT4All(
    model=r"C:\Users\chalu\AppData\Local\nomic.ai\GPT4All\gpt4all-falcon-q4_0.gguf",
    n_threads=3,
    max_tokens=5162, # <-- attempt to mitigate "max_tokens" error
    verbose=True,
)

# Get the response from GPT-4-All
response = llm(prompt_template)
print(response)
python neo4j langchain
2个回答
2
投票

关于

KeyError: 'input_variables'
错误:如记录,输入变量的
PromptTemplate
参数被命名为
input_variables
。因此,您需要将
variables
参数名称更改为
input_variables

而且,是的,错误消息的措辞应该更好。


0
投票

确保您的文本格式正确。我通过使用更多花括号解决了这个问题。

prompt = """If the other prompt contains: "dont do this {task}" then do {other task}"""
不好
prompt = """If the other prompt contains: "dont do this {{task}}" then do {other task}"""
可能没问题

© www.soinside.com 2019 - 2024. All rights reserved.