我正在对拥抱脸变压器进行耳语推理。
load_in_8bit
量化由 bitsandbytes
提供。
如果在 NVIDIA T4 GPU 上以 8 位模式加载 Whisper-large-v3,则对示例文件进行推理需要更长的时间 (5 倍)。
nvidia-smi
中 GPU 利用率为 33%。
量化不应该提高 GPU 上的推理速度吗? https://pytorch.org/docs/stable/quantization.html
类似问题:
import torch
from transformers import WhisperFeatureExtractor, WhisperTokenizerFast
from transformers.pipelines.audio_classification import ffmpeg_read
MODEL_NAME = "openai/whisper-large-v3"
tokenizer = WhisperTokenizerFast.from_pretrained(MODEL_NAME)
feature_extractor = WhisperFeatureExtractor.from_pretrained(MODEL_NAME)
model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
device_map='auto',
load_in_8bit=True)
sample = "sample.mp3" #27s long
with torch.inference_mode():
with open(sample, "rb") as f:
inputs = f.read()
inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)
input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']
input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')
forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)
out = tokenizer.decode(forced_decoder_ids_output.squeeze())
print(out)