如何在低内存 GPU 上运行 NLP+Transformers LLM？

Question

我正在尝试加载一个AI预训练模型，来自intel on Hugging Face，我使用了Colab，其资源超出了，使用了Kaggle资源增加，使用了paperspace，这向我显示了一个错误：

The kernel for Text_Generation.ipynb appears to have died. It will restart automatically.

这是模型加载脚本：

import transformers


model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]


# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.
"""

我的目的是加载这个预训练模型，我已经做了一些研究，我找到了一些解决方案，但不与我合作，

使用 cuda 而不是 pip 下载包

Answer 1

我建议研究模型量化，因为这是专门解决此类问题的方法之一，即加载大型模型进行推理。

TheBloke 提供了该模型的量化版本，可在此处获取：neural-chat-7B-v3-1-AWQ。要使用此功能，您需要使用 AutoAWQ，并且根据本笔记本中的 Hugging Face ，对于 Colab，您需要安装给定 Colab CUDA 版本的早期版本。

您还应该在生成输入张量后将

.cuda()

 添加到输入张量中，以确保您的模型使用的是 GPU，而不是 CPU：

!pip install -q transformers accelerate
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ'
### Use AutoAWQ and from quantized instead of transformers here
model = AutoAWQForCausalLM.from_quantized(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    ### ADD .cuda()
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda()

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]

如何在低内存 GPU 上运行 NLP+Transformers LLM？

问题描述投票：0回答：1

1个回答

最新问题

如何在低内存 GPU 上运行 NLP+Transformers LLM？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1