如何在python中录制时同时读取音频样本以将语音实时转换为文本?

问题描述 投票:0回答:1

[基本上,我已经训练了一些使用keras的模型来进行孤立的单词识别。目前,我可以使用声音设备录制功能录制音频一段固定的时间,并将音频文件另存为wav文件。我已经实现了静音检测,以修剪掉不需要的样本。但这在整个记录完成后全部正常。我想在同时录制的同时立即获取经过修剪的音频片段,以便可以实时进行语音识别。我正在使用python2和tensorflow 1.14.0。以下是我目前拥有的代码段,

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

请注意,以上代码仅包含与问题相关的内容,不会运行。我想概述一下到目前为止的情况,非常感谢您提供的有关如何继续根据阈值同时记录和修剪音频的输入,以便在10秒钟的记录时间内说出多个单词(代码中的秒变量),正如我所说的,当窗口大小为50ms的样本能量低于某个阈值时,我会在这两个点处剪切音频,进行修剪并将其用于预测。修剪的音频片段的录制和预测必须同时进行,以便在录制10秒钟后,每个输出单词发出声音后立即显示。真的很感谢我对此的任何建议。

python tensorflow multiprocessing speech-recognition real-time
1个回答
0
投票

很难说您的模型架构是什么,但是有些模型是专门为流识别而设计的。像Facebook's streaming convnets。但是,您将无法轻松地在Keras中实现它们。

© www.soinside.com 2019 - 2024. All rights reserved.