以下代码显示了 20 个 epoch 后非常正常的损失图表,但是当尝试使用种子文本对其进行测试时,它始终输出空行(“”)。要么是我根本不理解“文本种子”准备或处理预测的过程,要么是我没有注意到的一些微妙的错误,因此我向这个社区发出呼吁。
此处使用的数据是位于此处的 nietzsche.txt 的一小部分:
https://s3.amazonaws.com/text-datasets/nietzsche.txt
主要问题是预测总是有
argmax() = 1
。与测试数据关联的词典在该索引处具有“ ”,因此输出始终是一系列空白字符。然而,输入模型的数据似乎确实合理,如下面 [输出] 部分中所示的整数 x sequence
所示。您可以看到,每次迭代后,都会将一个新的(不同的)字符添加到 pattern
列表中,并且通过删除第一个字符对列表进行切片以保持恒定的长度。
我一直试图找出生成非空白文本时出错的地方,但无法找出问题所在。如果有任何帮助,我将不胜感激。
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
# load ascii text and covert to lowercase
filename = "/content/drive/MyDrive/Colab Notebooks/TexGen/nietzsche-short.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
# define the LSTM model
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(n_vocab, 50, input_length=seq_length),
tf.keras.layers.Conv1D(128, 5, activation='relu'), # CNN layer
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.LSTM(256, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(256),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(y.shape[2], activation='softmax')
], name="LSTM_Model")
model.compile(loss='categorical_crossentropy', optimizer='adam')
history = model.fit(X, y,
epochs=25,
batch_size=128
)
# Test the model with a seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start] # dataX is a list of list 100 characters each
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# seed:
# " ay upon words, a deception on the part of grammar, or an
# audacious generalization of very restricted "
# pattern[:10]
# [13, 37, 1, 33, 28, 27, 26, 1, 35, 27]
# generate characters
for i in range(5):
x = np.reshape(pattern, (1, len(pattern), 1))
x = x / float(n_vocab)
print("\n==================")
print("x[:10] : ", x[:, :10, :])
prediction = model.predict(x, verbose=0)
index = np.argmax(prediction)
result = int_to_char[index]
print("index = ", index)
print("result = ", result)
# seq_in = [int_to_char[value] for value in pattern]
#sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print("\nDone.")
[output]
# predictions:
array([[0.01481258, 0.1349412 , 0.00109681, 0.00254168, 0.00037268,
0.00085235, 0.00087828, 0.01556777, 0.00960505, 0.00228236,
0.00174035, 0.00123438, 0.00138978, 0.07334321, 0.01076234,
0.01881236, 0.03297085, 0.0944486 , 0.0203624 , 0.01628518,
0.04297792, 0.05854145, 0.00125222, 0.00374453, 0.02957868,
0.0199816 , 0.05518206, 0.05479056, 0.02143795, 0.000657 ,
0.04608261, 0.06542768, 0.08481915, 0.02293939, 0.00776505,
0.01505006, 0.00033998, 0.01451185, 0.00062015]], dtype=float32)
==================
x[:10] : [[[0.33333333]
[0.94871795]
[0.02564103]
[0.84615385]
[0.71794872]
[0.69230769]
[0.66666667]
[0.02564103]
[0.8974359 ]
[0.69230769]]]
index = 1
result =
==================
x[:10] : [[[0.94871795]
[0.02564103]
[0.84615385]
[0.71794872]
[0.69230769]
[0.66666667]
[0.02564103]
[0.8974359 ]
[0.69230769]
[0.76923077]]]
index = 1
result =
==================
x[:10] : [[[0.02564103]
[0.84615385]
[0.71794872]
[0.69230769]
[0.66666667]
[0.02564103]
[0.8974359 ]
[0.69230769]
[0.76923077]
[0.41025641]]]
index = 1
result =
==================
x[:10] : [[[0.84615385]
[0.71794872]
[0.69230769]
[0.66666667]
[0.02564103]
[0.8974359 ]
[0.69230769]
[0.76923077]
[0.41025641]
[0.79487179]]]
index = 1
result =
==================
x[:10] : [[[0.71794872]
[0.69230769]
[0.66666667]
[0.02564103]
[0.8974359 ]
[0.69230769]
[0.76923077]
[0.41025641]
[0.79487179]
[0.17948718]]]
index = 1
result =
Done.
[/output]
model.summary()
Model: "LSTM_Model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 1950
conv1d (Conv1D) (None, 96, 128) 32128
max_pooling1d (MaxPooling1 (None, 24, 128) 0
D)
lstm (LSTM) (None, 24, 256) 394240
dropout (Dropout) (None, 24, 256) 0
lstm_1 (LSTM) (None, 256) 525312
dropout_1 (Dropout) (None, 256) 0
dense (Dense) (None, 39) 10023
=================================================================
Total params: 963653 (3.68 MB)
Trainable params: 963653 (3.68 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
[1]: https://i.stack.imgur.com/QSWz3.jpg
[2]: https://s3.amazonaws.com/text-datasets/nietzsche.txt
我在tensorflow.org上的“使用RNN生成文本”下找到了以下内容:
要从模型中获得实际预测,您需要从输出分布中进行采样,以获得实际的字符索引。该分布由字符词汇表上的 logits 定义。
注意:从该分布中采样非常重要,因为采用分布的 argmax 很容易使模型陷入循环。
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
我仍然不太明白这在说什么,但提供的代码示例确实大大改善了结果。
如果有人能清楚地解释这是如何工作的,以及确切的原因
argmax can easily get the model stuck in a loop
,我将不胜感激。