当使用 Microsoft Azure Text To Speech with Unity 时,在播放声音的开头和结尾处会出现破音。

我使用的是Microsoft Azure Text To Speech with Unity。 但是在播放声音的开头和结尾会有破音。这是正常的吗,还是result.AudioData坏了。下面是代码。

    public AudioSource audioSource;
    void Start()
    public void SynthesisToSpeaker(string text)
        var config = SpeechConfig.FromSubscription("[redacted]", "southeastasia");
        config.SpeechSynthesisLanguage = "zh-CN";
        config.SpeechSynthesisVoiceName = "zh-CN-XiaoxiaoNeural";

        // Creates a speech synthesizer.
        // Make sure to dispose the synthesizer after use!       
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
        Task<SpeechSynthesisResult> task = synthesizer.SpeakTextAsync(text);
        StartCoroutine(CheckSynthesizer(task, config, synthesizer));
    private IEnumerator CheckSynthesizer(Task<SpeechSynthesisResult> task,
        SpeechConfig config,
        SpeechSynthesizer synthesizer)
        yield return new WaitUntil(() => task.IsCompleted);
        var result = task.Result;
        // Checks result.
        string newMessage = string.Empty;
        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
            var sampleCount = result.AudioData.Length / 2;
            var audioData = new float[sampleCount];
            for (var i = 0; i < sampleCount; ++i)
                audioData[i] = (short)(result.AudioData[i * 2 + 1] << 8
                        | result.AudioData[i * 2]) / 32768.0F;
            // The default output audio format is 16K 16bit mono
            var audioClip = AudioClip.Create("SynthesizedAudio", sampleCount,
                    1, 16000, false);
            audioClip.SetData(audioData, 0);
            audioSource.clip = audioClip;

        else if (result.Reason == ResultReason.Canceled)
            var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);

azure unity3d text-to-speech microsoft-cognitive

默认的音频格式是 Riff16Khz16BitMonoPcm的开头有一个riff header。result.AudioData. 如果你把音频数据传给audioClip,它会播放头,然后你会听到一些噪音。

你可以通过以下方法将格式设置为没有头的原始格式。speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw16Khz16BitMonoPcm);本样本 详见。

