无效参数:indexs [0,0] = -4不在[0,40405)中

问题描述 投票:2回答:1

我有一个正在处理某些数据的模型。我已经在数据集中添加了一些标记化的单词数据(为简洁起见被略去):

vocab_size = len(tokenizer.word_index) + 1
comment_texts = df.comment_text.values

tokenizer = Tokenizer(num_words=num_words)

tokenizer.fit_on_texts(comment_texts)
comment_seq = tokenizer.texts_to_sequences(comment_texts)
maxtrainlen = max_length(comment_seq)
comment_train = pad_sequences(comment_seq, maxlen=maxtrainlen, padding='post')
vocab_size = len(tokenizer.word_index) + 1

df.comment_text = comment_train

x = df.drop('label', 1) # the thing I'm training

labels = df['label'].values  # Also known as Y

x_train, x_test, y_train, y_test = train_test_split(
    x, labels, test_size=0.2, random_state=1337)        

n_cols = x_train.shape[1]

embedding_dim = 100  # TODO: why?

model = Sequential([
            Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=(n_cols,)),
            LSTM(32),
            Dense(32, activation='relu'),
            Dense(512, activation='relu'),
            Dense(12, activation='softmax'),  # for an unknown type, we don't account for that while training
        ])
        model.summary()

现在,出现此错误:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 12, 100)           4040500   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 512)               16896     
_________________________________________________________________
dense_2 (Dense)              (None, 12)                6156      
=================================================================
Total params: 4,081,632
Trainable params: 4,081,632
Non-trainable params: 0
_________________________________________________________________
Train on 4702 samples
Epoch 1/10
2020-03-04 22:37:59.499238: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: indices[0,0] = -4 is not in [0, 40405)

我认为这必须来自我的comment_text列,因为这是我唯一添加的内容。

我的完整代码(在进行更改之前)在这里:https://colab.research.google.com/drive/1y8Lhxa_DROZg0at3VR98fi5WCcunUhyc#scrollTo=hpEoqR4ne9TO

keras word-embedding implementation
1个回答
0
投票

您应该使用comment_train进行训练,而不是使用x进行训练,因为这将吸收未知df中的所有内容。

embedding_dim=100可以自由选择。就像隐藏层中的单位数一样。您可以调整此参数以找到最适合您的模型的参数,也可以调整隐藏层中的单位数。

© www.soinside.com 2019 - 2024. All rights reserved.