如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?

问题描述 投票:0回答:1

我花了很多时间来理解这个问题。谷歌的WaveNet工作 (也用于他们的DeepVoice模式),但对一些很基本的方面还是感到困惑。我指的是 Wavenet的这个Tensorflow实现。

论文的第2页说

"在本文中,我们介绍了一种新的直接在原始音频波形上操作的生成模型。".

如果我们已经有了 原始音波,为什么我们需要WaveNet?这不是模型应该生成的吗?

当我打印出模型时,它显示 input 仅仅是一个浮动值,在 input_convolution 内核,因为它的形状是1x1x128。输入中的那1个浮点数代表什么?我是不是遗漏了什么?

 `inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512`]

下面更多的层次。

---------
Variables: name (type shape) [size]
---------
inference/ConvTranspose1D_layer_0/kernel:0 (float32_ref 1x11x80x80) [70400, bytes: 281600]
inference/ConvTranspose1D_layer_0/bias:0 (float32_ref 80) [80, bytes: 320]
inference/ConvTranspose1D_layer_1/kernel:0 (float32_ref 1x25x80x80) [160000, bytes: 640000]
inference/ConvTranspose1D_layer_1/bias:0 (float32_ref 80) [80, bytes: 320]
inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512]
inference/input_convolution/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
tensorflow deep-learning conv-neural-network speech-recognition text-to-speech
1个回答
1
投票

生成式网络通常在条件概率上进行操作,得到的是 new_element 鉴于 old_element(s). 用数学术语来说。

conditional probability

正如谷歌论文中所定义的那样 正如你所看到的,网络需要从一些东西(the x1...xt-1 - 过去的值) ,它不能从头开始。你可以把它想象成网络需要一个主题,它会告诉它你对什么类型感兴趣;重金属和乡村有 略有不同 氛围。

如果你喜欢,你可以自己生成这个启动波形:正弦波、白噪声或更复杂的东西。一旦你运行网络,它就会开始输出新的数值,最终成为它的输入。

© www.soinside.com 2019 - 2024. All rights reserved.