NMT的尺寸定义和图像字幕在解码器部分的注意

Question

在下面的那些教程中，我一直在关注模型的检查。

https://www.tensorflow.org/tutorials/text/nmt_with_attention

和

https://www.tensorflow.org/tutorials/text/image_captioning

在两个教程中，我都不理解定义的解码器部分。

在具有注意解码器的NMT中，如下所示，

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

这里，＃x通过嵌入后的形状==（batch_size，1，embedding_dim）x = self.embedding（x）。 x在这里应该是什么？只是目标输入吗？
在上面，我不明白为什么输出形状必须为（batch_size * 1，hidden_size）。为什么batch_size * 1？

和下面的图像字幕解码器部分，

class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x, features, hidden):
    # defining attention as a separate model
    context_vector, attention_weights = self.attention(features, hidden)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

为什么输出形状必须重塑为（batch_size * max_length，hidden_size）？

有人可以给我详细吗？

这对我有很大帮助

Answer 1

重塑的原因是在TensorFlow中调用完全连接层（与Pytorch不同），它仅接受二维输入。

在第一个示例中，假定解码器的call方法是在for循环内针对每个时间步长（在训练和推理时）执行的。但是，GRU需要输入形状batch×length×dim，并且如果逐步调用，则长度为1。

[在第二个示例中，您可以在训练时在整个地面真相序列上调用解码器，但是它仍然可以使用长度1，因此可以在推理时在for循环中使用它。

NMT的尺寸定义和图像字幕在解码器部分的注意

问题描述投票：-1回答：1

1个回答

最新问题

NMT的尺寸定义和图像字幕在解码器部分的注意

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1