我正在尝试将 OpenAI 文本发送到语音流 (https://platform.openai.com/docs/guides/text-to-speech/streaming-real-time-audio) 到 Twilio websocket,它接受 mulaw /8khz
如果我等待整个 wav 缓冲区从 OpenAI 流式传输,然后将其一次性发送到 Twilio websocket,那么音频听起来不错,但我想在块可用于延迟目的时立即发送它们。这是发送整个缓冲区的代码:
function stream2buffer(stream) {
return new Promise((resolve, reject) => {
const _buf = [];
stream.on("data", (chunk) => _buf.push(chunk));
stream.on("end", () => resolve(Buffer.concat(_buf)));
stream.on("error", (err) => reject(err));
});
}
async function speakAll(text) {
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: text,
response_format: "wav",
});
return await stream2buffer(response.body);
}
...
import { WaveFile } from 'wavefile';
const openAIAudio = await speakAll(response);
const wav = new WaveFile();
wav.fromBuffer(openAIAudio);
wav.toSampleRate(8000);
wav.toMuLaw();
const mulaw = Buffer.from(wav.data.samples);
const payload = mulaw.toString("base64");
...
this.ws.send(
JSON.stringify({
event: "media",
streamSid: this.streamSid,
media: {
payload,
},
})
);
this.ws.send(
JSON.stringify({
event: "mark",
streamSid: this.streamSid,
mark: {
name: "response",
},
})
);
但是,如果我尝试在 wav 块到达 mulaw 并发送时对其进行转换,我会收到大量静电,几乎无法辨别原始音频。这是我正在使用的代码:
import { WaveFile } from 'wavefile';
import { encodeWav } from "wav-converter";
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: text,
response_format: "wav",
});
response.body.on("data", (chunk) => {
// add WAV headers to chunk, or else WaveFile will throw error
const wavFile = encodeWav(chunk, {
numChannels: 1,
sampleRate: 24000,
byteRate: 16,
});
const wav = new WaveFile(wavFile);
wav.toSampleRate(8000);
wav.toMuLaw();
const mulaw = Buffer.from(wav.data.samples);
let payload = mulaw.toString("base64");
try {
this.ws.send(
JSON.stringify({
event: "media",
streamSid: this.streamSid,
media: {
payload,
},
})
);
this.ws.send(
JSON.stringify({
event: "mark",
streamSid: this.streamSid,
mark: {
name: "response",
},
})
);
} catch (e) {
this.L.error("failed to send voice response to ws: " + e);
}
});
如果我将几个 wav 块连接在一起,然后转换为 mulaw,我会得到稍微好一点的结果,但仍然有很多静态。我想知道块大小对齐是否缺少一些东西?
我面临同样的问题,想知道您是否有流媒体部分的解决方案