我正在尝试加载一个AI预训练模型,来自intel on Hugging Face,我使用了Colab,其资源超出了,使用了Kaggle资源增加,使用了paperspace,这向我显示了一个错误:
The kernel for Text_Generation.ipynb appears to have died. It will restart automatically.
这是模型加载脚本:
import transformers
model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
def generate_response(system_input, user_input):
# Format the input using the provided template
prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
# Tokenize and encode the prompt
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)
# Generate a response
outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's response
return response.split("### Assistant:\n")[-1]
# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)
# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:
1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60
Step 1: Add 100 and 520
100 + 520 = 620
Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680
So, the sum of 100, 520, and 60 is 680.
"""
我的目的是加载这个预训练模型,我已经做了一些研究,我找到了一些解决方案,但不与我合作,
使用 cuda 而不是 pip 下载包
我建议研究模型量化,因为这是专门解决此类问题的方法之一,即加载大型模型进行推理。
TheBloke 提供了该模型的量化版本,可在此处获取:neural-chat-7B-v3-1-AWQ。要使用此功能,您需要使用 AutoAWQ,并且根据本笔记本中的 Hugging Face ,对于 Colab,您需要安装给定 Colab CUDA 版本的早期版本。
您还应该在生成输入张量后将.cuda()
添加到输入张量中,以确保您的模型使用的是 GPU,而不是 CPU:
!pip install -q transformers accelerate
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ'
### Use AutoAWQ and from quantized instead of transformers here
model = AutoAWQForCausalLM.from_quantized(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate_response(system_input, user_input):
# Format the input using the provided template
prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
### ADD .cuda()
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda()
# Generate a response
outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's response
return response.split("### Assistant:\n")[-1]