Python 语音识别可以作为音频源与 WebRTC 一起使用吗?

问题描述 投票:0回答:1

我需要实现可以使用 WebRTC 作为音频源的连续实时语音到文本。我很想使用

speech_recognition
库(here),因为它有这个美妙的
.listen()
方法,可以完美地用作 VAD,并且还可以轻松创建一个 wav 文件,以便稍后将其提供给我选择的 STT 模型。简而言之,我想像这样构建它(归功于Ashutosh Dongare):

audio_input_file = "micAudioInput.wav"  
audio_output_file = "ttsAudioOutput.wav"

while True:

    try:
        with sr.Microphone() as SR_AudioSource:
            print("Say something...")
            mic_audio = r_audio.listen(SR_AudioSource,1,4) # timeout=1, phrase_time_limit=4

    except: #This is required if there is no Audio input or some error
        print("Found not capture mic input or no audio...")
        continue #continue next mic input loop if there was any error

    with open(audio_input_file, "wb") as file:
        file.write(mic_audio.get_wav_data())
        file.flush()
        file.close()

    # Check if there is input audio file saved otherwise continue listening 
    if not os.path.exists(audio_input_file):
        print("no input file exists")
        continue

    # Read audio file into batches and create input pipelie for STT
    batches = split_into_batches(glob(audio_input_file), batch_size=10)
    readbatchout = read_batch(batches[0])
    input = prepare_model_input(read_batch(batches[0]), device=device)

    #feed to STT model and get the text output
    output = stt_model(input)
    you_said = decoder(output[0].cpu())
    print(you_said)

    if(you_said == ""):
        print("No speech recognized...")
        continue
    
    #check if user wants to stop
    if(re.search("exit",you_said) or re.search("stop",you_said) or re.search("quit",you_said)):
        break

它可以在本地使用物理麦克风完美运行。唯一的问题是,我非常确定

speech_recognition
只接受作为
AudioSource
运行的设备的麦克风(并且我知道 PyAudio 只能以这种方式工作),因为 文档指出
 Microphone
班级

[c]创建一个新的麦克风实例,它代表一个物理麦克风 电脑上的麦克风。

这是否意味着无法将其与 WebRTC 作为源一起使用?如果没有,什么是它的

.listen()
的良好替代品?另外:无限循环的概念在浏览器中通常是可行的还是一个坏主意?我计划在 Django 中实现它并处理 WebSocket 带有 Django Channels 的服务器。

python webrtc speech-recognition django-channels
1个回答
0
投票

如果有人偶然发现这个答案,答案有点晚了。是的,SpeechRecognition 可以与 WebRTC 作为音频源一起使用。

我是这样做的:

  • aiortc
    实现自定义 MediaStreamTrack
    • 自定义
      MediaStreamTrack
      有一个来自 SpeechRecognition 的自定义
      AudioSource
      成员
    • 在接收时,它会重新采样到 AudioFrame 并将帧写入自定义 AudioSource
       中存在的 
      AudioFifo
  • 自定义的
    AudioSource
    只是Fifo的一个直通,添加了一点逻辑

首先,这是自定义

MediaStreamTrack
的代码,灵感来自 aiortc 的 VideoTransformTrack 示例

class AudioTransformTrack(MediaStreamTrack):
    kind = "audio"

    def __init__(self, track):
        super().__init__()
        self.track = track
        rate = 16_000  # Whisper has a sample rate of 16000
        audio_format = 's16p'
        sample_width = av.AudioFormat(audio_format).bytes
        self.resampler = av.AudioResampler(format=audio_format, layout='mono', rate=rate)
        self.source = WebRTCSource(sample_rate=rate, sample_width=sample_width)

    async def recv(self):
        out_frame: av.AudioFrame = await self.track.recv()

        out_frames = self.resampler.resample(out_frame)

        for frame in out_frames:
            self.source.stream.write(frame)

        return out_frame

接下来是受 Microphone 类启发的自定义 AudioSource 的实现:

class WebRTCSource(AudioSource):
    def __init__(self, sample_rate=None, chunk_size=1024, sample_width=4):
        # Those are the only 4 properties required by the recognizer.listen method
        self.stream = WebRTCSource.MicrophoneStream()
        self.SAMPLE_RATE = sample_rate  # sampling rate in Hertz
        self.CHUNK = chunk_size  # number of frames stored in each buffer
        self.SAMPLE_WIDTH = sample_width  # size of each sample

    class MicrophoneStream(object):
        def __init__(self):
            self.stream = av.AudioFifo()
            self.event = threading.Event()

        def write(self, frame: av.AudioFrame):
            assert type(frame) is av.AudioFrame, "Tried to write something that is not AudioFrame"
            self.stream.write(frame=frame)
            self.event.set()

        def read(self, size) -> bytes:
            frames: av.AudioFrame = self.stream.read(size)

            # while no frame, wait until some is written using an event
            while frames is None:
                self.event.wait()
                self.event.clear()
                frames = self.stream.read(size)

            # convert the frame to bytes
            data: np.ndarray = frames.to_ndarray()
            return data.tobytes()

收听源时,请确保异步执行此操作。就我而言,我启动了一个新线程来进行监听。

def listen(source: AudioSource):
    recognizer = sr.Recognizer()

    while True:
        audio = recognizer.listen(source)
        # do something with the audio...

# on webrtc receive track:
t = AudioTransformTrack(relay.subscribe(track))
thread = threading.Thread(target=listen, args=(t.source,))
thread.start()
© www.soinside.com 2019 - 2024. All rights reserved.