我正在尝试使用 OpenAI 的开源 Whisper 库来转录音频文件。
这是我的脚本的源代码:
import whisper
model = whisper.load_model("large-v2")
# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")
#When i write that code snippet here ==> audio = whisper.pad_or_trim(audio) the first 30 secs are converted and without any problem they are converted.
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# decode the audio
options = whisper.DecodingOptions(fp16=False)
result = whisper.decode(model, mel, options)
# print the recognized text if available
try:
if hasattr(result, "text"):
print(result.text)
except Exception as e:
print(f"Error while printing transcription: {e}")
# write the recognized text to a file
try:
with open("output_of_file.txt", "w") as f:
f.write(result.text)
print("Transcription saved to file.")
except Exception as e:
print(f"Error while saving transcription: {e}")
在这里:
# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")
当我在下面写下:“audio = tweet.pad_or_trim(audio)”时,声音文件的前 30 秒会毫无问题地转录,并且语言检测也能正常工作,
但是当我删除它并希望转录整个文件时,我收到以下错误:
断言错误:音频形状不正确
我该怎么办?我应该更改声音文件的结构吗?如果是,我应该使用哪个库以及应该编写什么类型的脚本?
我遇到了同样的问题,经过一番挖掘后我发现
whisper.decode
旨在提取有关输入的元数据,例如语言,因此限制为 30 秒。 (请参阅解码函数的源代码此处)
为了转录(即使是超过 30 秒的音频),您可以使用
whisper.transcribe
,如以下代码片段所示
import whisper
model = whisper.load_model("large-v2")
# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")
options = {
"language": "en", # input language, if omitted is auto detected
"task": "translate" # or "transcribe" if you just want transcription
}
result = whisper.transcribe(model, audio, **options)
print(result["text"])
您可以在源代码中找到一些关于 transcribe 方法的文档以及一些关于 DecodingOptions 结构
的文档