文本生成始终会产生空白字符

问题描述 投票:0回答:1

以下代码显示了 20 个 epoch 后非常正常的损失图表,但是当尝试使用种子文本对其进行测试时,它始终输出空行(“”)。要么是我根本不理解“文本种子”准备或处理预测的过程,要么是我没有注意到的一些微妙的错误,因此我向这个社区发出呼吁。

Loss Function

此处使用的数据是位于此处的 nietzsche.txt 的一小部分:

https://s3.amazonaws.com/text-datasets/nietzsche.txt

主要问题是预测总是有

argmax() = 1
。与测试数据关联的词典在该索引处具有“ ”,因此输出始终是一系列空白字符。然而,输入模型的数据似乎确实合理,如下面 [输出] 部分中所示的整数
x sequence
所示。您可以看到,每次迭代后,都会将一个新的(不同的)字符添加到
pattern
列表中,并且通过删除第一个字符对列表进行切片以保持恒定的长度。

我一直试图找出生成非空白文本时出错的地方,但无法找出问题所在。如果有任何帮助,我将不胜感激。

import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

# load ascii text and covert to lowercase
filename = "/content/drive/MyDrive/Colab Notebooks/TexGen/nietzsche-short.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length]
 seq_out = raw_text[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = to_categorical(dataY)

# define the LSTM model
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(n_vocab, 50, input_length=seq_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),  # CNN layer
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(256, return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(256),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(y.shape[2], activation='softmax')
], name="LSTM_Model")

model.compile(loss='categorical_crossentropy', optimizer='adam')

history = model.fit(X, y,
          epochs=25,
          batch_size=128
)

# Test the model with a seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start] # dataX is a list of list 100 characters each
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

# seed:
# " ay upon words, a deception on the part of grammar, or an
# audacious generalization of very restricted "

# pattern[:10]
# [13, 37, 1, 33, 28, 27, 26, 1, 35, 27]

# generate characters
for i in range(5):
  x = np.reshape(pattern, (1, len(pattern), 1))
  x = x / float(n_vocab)
  print("\n==================")
  print("x[:10] : ", x[:, :10, :])
  prediction = model.predict(x, verbose=0)
  index = np.argmax(prediction)
  result = int_to_char[index]
  print("index = ", index)
  print("result = ", result)
  # seq_in = [int_to_char[value] for value in pattern]
  #sys.stdout.write(result)
  pattern.append(index)
  pattern = pattern[1:len(pattern)]
print("\nDone.")


[output]

# predictions:

array([[0.01481258, 0.1349412 , 0.00109681, 0.00254168, 0.00037268,
        0.00085235, 0.00087828, 0.01556777, 0.00960505, 0.00228236,
        0.00174035, 0.00123438, 0.00138978, 0.07334321, 0.01076234,
        0.01881236, 0.03297085, 0.0944486 , 0.0203624 , 0.01628518,
        0.04297792, 0.05854145, 0.00125222, 0.00374453, 0.02957868,
        0.0199816 , 0.05518206, 0.05479056, 0.02143795, 0.000657  ,
        0.04608261, 0.06542768, 0.08481915, 0.02293939, 0.00776505,
        0.01505006, 0.00033998, 0.01451185, 0.00062015]], dtype=float32)

==================
x[:10] :  [[[0.33333333]
  [0.94871795]
  [0.02564103]
  [0.84615385]
  [0.71794872]
  [0.69230769]
  [0.66666667]
  [0.02564103]
  [0.8974359 ]
  [0.69230769]]]
index =  1
result =   

==================
x[:10] :  [[[0.94871795]
  [0.02564103]
  [0.84615385]
  [0.71794872]
  [0.69230769]
  [0.66666667]
  [0.02564103]
  [0.8974359 ]
  [0.69230769]
  [0.76923077]]]
index =  1
result =   

==================
x[:10] :  [[[0.02564103]
  [0.84615385]
  [0.71794872]
  [0.69230769]
  [0.66666667]
  [0.02564103]
  [0.8974359 ]
  [0.69230769]
  [0.76923077]
  [0.41025641]]]
index =  1
result =   

==================
x[:10] :  [[[0.84615385]
  [0.71794872]
  [0.69230769]
  [0.66666667]
  [0.02564103]
  [0.8974359 ]
  [0.69230769]
  [0.76923077]
  [0.41025641]
  [0.79487179]]]
index =  1
result =   

==================
x[:10] :  [[[0.71794872]
  [0.69230769]
  [0.66666667]
  [0.02564103]
  [0.8974359 ]
  [0.69230769]
  [0.76923077]
  [0.41025641]
  [0.79487179]
  [0.17948718]]]
index =  1
result =   

Done.
[/output]
model.summary()

Model: "LSTM_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 100, 50)           1950      
                                                                 
 conv1d (Conv1D)             (None, 96, 128)           32128     
                                                                 
 max_pooling1d (MaxPooling1  (None, 24, 128)           0         
 D)                                                              
                                                                 
 lstm (LSTM)                 (None, 24, 256)           394240    
                                                                 
 dropout (Dropout)           (None, 24, 256)           0         
                                                                 
 lstm_1 (LSTM)               (None, 256)               525312    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense (Dense)               (None, 39)                10023     
                                                                 
=================================================================
Total params: 963653 (3.68 MB)
Trainable params: 963653 (3.68 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


  [1]: https://i.stack.imgur.com/QSWz3.jpg
  [2]: https://s3.amazonaws.com/text-datasets/nietzsche.txt
nlp text-generation
1个回答
0
投票

我在tensorflow.org上的“使用RNN生成文本”下找到了以下内容:

要从模型中获得实际预测,您需要从输出分布中进行采样,以获得实际的字符索引。该分布由字符词汇表上的 logits 定义。

注意:从该分布中采样非常重要,因为采用分布的 argmax 很容易使模型陷入循环。

sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

我仍然不太明白这在说什么,但提供的代码示例确实大大改善了结果。

如果有人能清楚地解释这是如何工作的,以及确切的原因

argmax can easily get the model stuck in a loop
,我将不胜感激。

© www.soinside.com 2019 - 2024. All rights reserved.