我从 BERT 嵌入生成嵌入矩阵,如下所示:
# Load pre-trained model tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained('bert-base-multilingual-cased')
# Define batch size
batch_size = 1
# Tokenize and encode input data in batches
encoded_inputs = []
for i in range(0, len(labeled_data), batch_size):
inputs = labeled_data[i:i+batch_size]
encoded_inputs.append(tokenizer.batch_encode_plus(inputs, padding=True, truncation=True, return_tensors="pt"))
# Generate embeddings for each batch
embeddings_new = []
for encoded_input in tqdm(encoded_inputs):
with torch.no_grad():
model_output = model(**encoded_input)
batch_embeddings = model_output.last_hidden_state.mean(dim=1)
embeddings_new.append(batch_embeddings)
embeddings_new = tf.concat(embeddings_new, axis=0)
embedding_matrix = model.embeddings.word_embeddings.weight
embedding_matrix = embedding_matrix.cpu().detach().numpy()
embed_tensor = tf.convert_to_tensor(embedding_matrix, dtype=tf.float32)
然后使用这个矩阵作为 LSTM 的初始权重:
lstm_out1 = 150
embed_dim = 768
model = Sequential()
model.add(Embedding(embedding_matrix.shape[0], embed_dim, weights=[embed_tensor], input_length=50, trainable=False))
model.add(LSTM(lstm_out1, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy'])
model.summary()
现在当我调用 model.fit 时它会产生错误。
model.fit(tokenized_sentences, labels, batch_size=5, epochs=1, shuffle=True)
错误是:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
tokenized_sentences的内容是一个列表:
[[101, 10406, 10161, 170, 10350, 87881, 10113, 21681, 11460, 13080, 10114, 10380, 10201, 10113, 108850, 10138, 12718, 10112, 10126, 15694, 10269, 62137, 13173, 10483, 102].....]
标签的内容也是一个列表:
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0]
embed_tensor 如下所示:
tf.Tensor(
[[ 0.02595074 -0.00617341 -0.00409975 ... 0.02965234 0.02417551
0.01970279]
[ 0.01038065 -0.0136286 0.00672081 ... 0.01237162 0.0267217
0.03370738]
[ 0.0220679 -0.00360613 0.01932366 ... 0.0069061 0.026809
0.00498276]
...
[ 0.00684139 0.01885802 0.02666426 ... 0.02292391 0.06465269
0.04373793]
[ 0.0183579 0.01480132 0.02434449 ... 0.03205629 0.00708906
0.02039703]
[ 0.02139908 0.01879423 -0.01343376 ... -0.00597953 0.00583893
-0.00586251]], shape=(119547, 768), dtype=float32)
为什么 model.fit() 报错?
因为我可以看到您专门将其他输入投射到 TF 张量,您也可以强制尝试将标签投射到:
labels = np.array(labels).astype('float32')
或
labels = tf.cast(labels, dtype=tf.float32)