我正在使用LSTM生成新闻头条。它应该根据序列中的前一个字符来预测下一个字符。我备有超过一百万个新闻头条的文件,但出于速度原因,我选择查看随机选择的十万条新闻。
[当我尝试训练模型时,仅在第一个时期它达到1.0验证精度和0.9986训练精度。这当然是不正确的。我不认为这不是数据不足的问题,因为90000个训练数据点应该绰绰有余。这似乎不仅仅是基本的过拟合。这似乎也花费了过多的时间(每个时期大约需要2.5分钟),但是我以前从未使用过LSTM,所以我不确定火车时间会怎样。是什么导致我的模型像这样执行?
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Import Libraries Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dropout, Dense
import datetime
import matplotlib.pyplot as plt
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Load Data Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
headlinesFull = []
with open("abcnews-date-text.csv", "r") as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
for lines in csv_reader:
headlinesFull.append(lines['headline_text'])
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Pretreat Data Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# shuffle and select 100000 headlines
np.random.shuffle(headlinesFull)
headlines = headlinesFull[:100000]
# add spaces to make ensure each headline is the same length as the longest headline
max_len = max(map(len, headlines))
headlines = [i + " "*(max_len-len(i)) for i in headlines]
# integer encode sequences of words
# create the tokenizer
t = Tokenizer(char_level=True)
# fit the tokenizer on the headlines
t.fit_on_texts(headlines)
sequences = t.texts_to_sequences(headlines)
# vocabulary size
vocab_size = len(t.word_index) + 1
# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_len = X.shape[1]
# split data for validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Define Model Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_len))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Train Model Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# fit model
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=128, epochs=1)
Train on 90000 samples, validate on 10000 samples
Epoch 1/1
90000/90000 [==============================] - 161s 2ms/step - loss: 0.0493 - acc: 0.9986 - val_loss: 2.3842e-07 - val_acc: 1.0000
通过观察代码,我可以推断出是,
headlines = [i + " "*(max_len-len(i)) for i in headlines]
解决方案:
您可以在标题的开头添加填充符,而不是在末尾添加。
headlines = [" "*(max_len-len(i)) + i for i in headlines]
或者,将标题分为X和Y后,在每个输入的末尾添加填充符。