如何在低内存 GPU 上运行 NLP+Transformers LLM?

问题描述 投票:0回答:1

我正在尝试加载一个AI预训练模型,来自intel on Hugging Face,我使用了Colab,其资源超出了,使用了Kaggle资源增加,使用了paperspace,这向我显示了一个错误:

The kernel for Text_Generation.ipynb appears to have died. It will restart automatically.

这是模型加载脚本:

import transformers


model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]


# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.
"""

我的目的是加载这个预训练模型,我已经做了一些研究,我找到了一些解决方案,但不与我合作,

使用 cuda 而不是 pip 下载包

python nlp gpu huggingface-transformers huggingface-tokenizers
1个回答
0
投票

我建议研究模型量化,因为这是专门解决此类问题的方法之一,即加载大型模型进行推理。

TheBloke 提供了该模型的量化版本,可在此处获取:neural-chat-7B-v3-1-AWQ。要使用此功能,您需要使用 AutoAWQ,并且根据本笔记本中的 Hugging Face ,对于 Colab,您需要安装给定 Colab CUDA 版本的早期版本。

您还应该在生成输入张量后将

.cuda()

 添加到输入张量中,以确保您的模型使用的是 GPU,而不是 CPU:

!pip install -q transformers accelerate !pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl import torch from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ' ### Use AutoAWQ and from quantized instead of transformers here model = AutoAWQForCausalLM.from_quantized(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) def generate_response(system_input, user_input): # Format the input using the provided template prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n" ### ADD .cuda() inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda() # Generate a response outputs = model.generate(inputs, max_length=1000, num_return_sequences=1) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Extract only the assistant's response return response.split("### Assistant:\n")[-1]
    
© www.soinside.com 2019 - 2024. All rights reserved.