我正在使用 V100 GPU、高 RAM 模式在 google colab 上开发一个 LLM 项目,这些是我的依赖项:
git+https://github.com/pyannote/pyannote-audio
git+https://github.com/huggingface/[email protected]
openai==0.28
ffmpeg-python
pandas==1.5.0
tokenizers==0.14
torch==2.1.1
torchaudio==2.1.1
tqdm==4.64.1
EasyNMT==2.0.2
psutil==5.9.2
requests
pydub
docxtpl
faster-whisper==0.10.0
git+https://github.com/openai/whisper.git
这是我导入的所有内容:
from faster_whisper import WhisperModel
from datetime import datetime, timedelta
from time import time
from pathlib import Path
import pandas as pd
import os
from pydub import AudioSegment
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import requests
import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import wave
import contextlib
import psutil
import openai
from codecs import decode
from docxtpl import DocxTemplate
我曾经使用最新版本的 torch 和 torchaudio,但他们昨天得到了更新(2023 年 12 月 15 日,v2.1.2 发布)。我认为我遇到的错误是由更新引起的,所以我将它们固定到我的代码在 2 天前运行的版本 (v2.1.1)。显然,这不起作用。
两天前一切正常,我没有更改笔记本中的任何内容。唯一可能发生变化的是我正在使用的依赖项,但使用以前的版本并没有解决我的问题。这是引发错误的代码片段:
def EETDT(audio_path, whisper_model, num_speakers, output_name="diarization_result", selected_source_lang="eng", transcript=None):
"""
Uses Whisper to seperate audio into segments and generate transcripts.
segment.
Speech Recognition is based on models from OpenAI Whisper https://github.com/openai/whisper
Speaker diarization model and pipeline from by https://github.com/pyannote/pyannote-audio
audio_path : str -> path to wav file
whisper_model : str -> small/medium/large/large-v2/large-v3
num_speakers : int -> number of speakers in audio (0 to let the function determine it)
output_name : str -> Desired name of the output file
selected_source_lang : str -> language's code
"""
audio_name = audio_path.split("/")[-1].split(".")[0]
model = WhisperModel(whisper_model, compute_type="int8")
time_start = time()
if(audio_path == None):
raise ValueError("Error no video input")
print("Input file:", audio_path)
if not audio_path.endswith(".wav"):
print("Submitted audio isn't in wav format. Starting conversion...")
audio = AudioSegment.from_file(audio_path)
audio_suffix = audio_path.split(".")[-1]
new_path = audio_path.replace(audio_suffix,"wav")
audio.export(new_path, format="wav")
audio_path = new_path
print("Converted to wav:", new_path)
try:
# Get duration
with contextlib.closing(wave.open(audio_path,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = frames / float(rate)
if duration<30:
raise ValueError(f"Audio has to be longer than 30 seconds. Current: {duration}")
print(f"Duration of audio file: {duration}")
# Transcribe audio
options = dict(language=selected_source_lang, beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)
segments_raw, info = model.transcribe(audio_path, **transcribe_options)
# Convert back to original openai format
segments = []
i = 0
full_transcript = list()
if type(transcript) != type(pd.DataFrame()):
for segment_chunk in segments_raw: # <-- THROWS ERROR
chunk = {}
chunk["start"] = segment_chunk.start
chunk["end"] = segment_chunk.end
chunk["text"] = segment_chunk.text
full_transcript.append(segment_chunk.text)
segments.append(chunk)
i += 1
full_transcript = "".join(full_transcript)
print("Transcribe audio done with fast-whisper")
else:
for i in range(len(transcript)):
full_transcript.append(transcript["text"].iloc[i])
full_transcript = "".join(full_transcript)
print("You inputted pre-transcribed audio")
except Exception as e:
raise RuntimeError("Error converting video to audio")
...The code never leaves the try block...
我今天在 Google Colab 上尝试使用 fast-whisper 时遇到了同样的问题。此自定义耳语实现仍需要使用 Cuda 11 作为要求,并且不适用于 Cuda 12。
我尝试查看 colab 实例的内部,它确实已切换到 cuda 12,这意味着 fast-whisper 无法运行,因为缺少依赖项。
如果您想尝试让它与 Cuda 12 一起工作,应该可以通过从源代码重建 CTranslate2 来实现,这里有一个关于此问题的参考问题: https://github.com/OpenNMT/CTranslate2/issues/1250