为什么我的琐碎的LSTM不适合？

Question

我创建了一个非常简单的LSTM来尝试预测短序列，但它不会过度拟合并以我期望的方式接近零损失。

相反，它只会收缩〜1.5左右，即使它确实有足够的自由度来逐字学习这个序列。

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [[0, 1, 2, 3, 4, 5, 0, 6, 7, 0]]
train_output = [[1, 2, 3, 4, 5, 0, 6, 7, 8, 0]]

#
# Training metadata
#
batch_size = 1
sequence_length = 10
n_classes = 9

# Network size
rnn_cell_size = 10
rnn_layers = 2
embedding_rank = 3

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None, sequence_length], name="features")
embeddings = tf.Variable(tf.random_uniform([n_classes, embedding_rank], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, features)
cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_cell_size) for i in range(rnn_layers)])
initial_state = cell.zero_state(batch_size, tf.float32)
cell, _ = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# Convert sequences x batches x outputs to (sequences * batches) x outputs
flat_lstm_output = tf.reshape(cell, [-1, rnn_cell_size])
output = tf.contrib.layers.fully_connected(inputs=flat_lstm_output, num_outputs=n_classes)
softmax = tf.nn.softmax(output)

#
# Training
#
targets = tf.placeholder(tf.int32, [None, sequence_length])
# Convert sequences x batches x targets to (sequences * batches) x targets
flat_targets = tf.reshape(targets, [-1])
loss = tf.losses.sparse_softmax_cross_entropy(flat_targets, softmax)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)

我觉得这里缺少一些非常基本的东西 - 我做错了什么？

编辑

我试图将它简化得更多，现在我要回到以下更简单的例子（也不会收敛）：

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [0, 1, 2, 3, 4]
train_output = [1, 2, 3, 4, 5]

#
# Training metadata
#
batch_size = 1
sequence_length = 5
n_classes = 6

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None])
one_hot = tf.contrib.layers.one_hot_encoding(features, n_classes)
output = tf.contrib.layers.fully_connected(inputs=one_hot, num_outputs=10)
output = tf.contrib.layers.fully_connected(inputs=output, num_outputs=n_classes)

#
# Training
#
targets = tf.placeholder(tf.int32, [None])
one_hot_targets = tf.one_hot(targets, depth=n_classes)
loss = tf.losses.softmax_cross_entropy(one_hot_targets, output)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)

Answer 1

您是否检查了学习率的较低值（例如，0.001或0.0001）？

Answer 2

您的网络不合适（更不用说过度拟合），因为您没有足够的数据。 LSTM只有一个序列，MLP有5个数据点。

将此与您需要估计的参数数量进行比较：您的MLP有120个参数（如果我正确计数）。除非你非常幸运，否则你无法估计所有这些只有5个数据点。（你可以通过将你的序列分成更小的批次来使它更容易收敛，但即便如此，它也不会经常收敛）。

简而言之，神经网络需要相当数量的数据才能使用。

Answer 3

答案是三倍的。

1）如果我用tanh替换完全连接层（relu）中的默认激活，则没有RNN的示例会收敛。

这似乎是因为relu忽略了很多输入（一切都低于零）并且根本不提供渐变。有了更多的输入，它可能有用。

2）带有RNN的示例需要在完全使用None的情况下去除最终完全连接层（在softmax之前）中的激活 - 它与完全连接层的激活不能很好地收敛（或者根本不收敛）在softmax面前。

3）RNN示例还需要删除显式softmax，因为sparse_softmax_cross_entropy已经应用了softmax。

最后工作代码：

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [[0, 1, 2, 3, 4, 5, 0, 6, 7, 0]]
train_output = [[1, 2, 3, 4, 5, 0, 6, 7, 8, 0]]

#
# Training metadata
#
batch_size = 1
sequence_length = 10
n_classes = 9

# Network size
rnn_cell_size = 10
rnn_layers = 2
embedding_rank = 3

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None, sequence_length], name="features")
embeddings = tf.Variable(tf.random_uniform([n_classes, embedding_rank], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, features)
cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_cell_size) for i in range(rnn_layers)])
initial_state = cell.zero_state(batch_size, tf.float32)
cell, _ = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# Convert [batche_size, sequence_length, rnn_cell_size] to [(batch_size * sequence_length), rnn_cell_size]
flat_lstm_output = tf.reshape(cell, [-1, rnn_cell_size])
output = tf.contrib.layers.fully_connected(inputs=flat_lstm_output, num_outputs=n_classes, activation_fn=None)

#
# Training
#
targets = tf.placeholder(tf.int32, [None, sequence_length])
# Convert [batch_size, sequence_length] to [batch_size * sequence_length]
flat_targets = tf.reshape(targets, [-1])
loss = tf.losses.sparse_softmax_cross_entropy(flat_targets, output)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)

为什么我的琐碎的LSTM不适合？

问题描述投票：0回答：3

3个回答

最新问题

为什么我的琐碎的LSTM不适合？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3