我的张量流模型参数为 230MB,使用 300MB 数据集,但在一个 epoch 后崩溃。它在二元分类问题上训练 CNN。
系统具有 16GB RAM 和 RTX 4070Ti
在一个纪元后,我收到一条消息,提示它正在尝试分配 12.5 GB:
102/102 [==============================] - ETA: 0s - loss: 1.0180 - accuracy: 0.6293 - precision: 0.0000e+00 - recall: 0.0000e+002024-04-12 08:04:29.166412: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 12582912000 exceeds 10% of free system memory.
Killed
将批量大小减少到 1 并没有帮助,而且我需要能够在最多 64 的批量大小上进行训练。考虑到我的数据集较小,这应该不是问题。
模型(总参数:60130177(229.38 MB)):
def create_dual_stream_cnn_model(input_shape):
# Define the inputs for each stream
input = Input(shape=input_shape)
# Stream 1
x = Conv1D(64, 3, activation='relu', padding='same')(input)
x = Conv1D(64, 3, activation='relu', padding='same')(x)
x = MaxPooling1D(3, strides=3)(x)
x = Conv1D(128, 3, activation='relu', padding='same')(x)
x = Conv1D(128, 3, activation='relu', padding='same')(x)
x = MaxPooling1D(3, strides=3)(x)
x = Conv1D(256, 3, activation='relu', padding='same')(x)
x = Conv1D(256, 3, activation='relu', padding='same')(x)
x = MaxPooling1D(2, strides=2)(x)
x = Conv1D(512, 3, activation='relu', padding='same')(x)
x = Conv1D(512, 3, activation='relu', padding='same')(x)
x = MaxPooling1D(2, strides=2)(x)
x = Conv1D(512, 3, activation='relu', padding='same')(x)
x = Conv1D(512, 3, activation='relu', padding='same')(x)
x = MaxPooling1D(2, strides=2)(x)
# Stream 2
y = Conv1D(64, 7, activation='relu', padding='same')(input)
y = Conv1D(64, 7, activation='relu', padding='same')(y)
y = MaxPooling1D(3, strides=3)(y)
y = Conv1D(128, 7, activation='relu', padding='same')(y)
y = Conv1D(128, 7, activation='relu', padding='same')(y)
y = MaxPooling1D(3, strides=3)(y)
y = Conv1D(256, 3, activation='relu', padding='same')(y)
y = Conv1D(256, 3, activation='relu', padding='same')(y)
y = MaxPooling1D(2, strides=2)(y)
y = Conv1D(512, 3, activation='relu', padding='same')(y)
y = Conv1D(512, 3, activation='relu', padding='same')(y)
y = MaxPooling1D(2, strides=2)(y)
y = Conv1D(512, 3, activation='relu', padding='same')(y)
y = Conv1D(512, 3, activation='relu', padding='same')(y)
y = MaxPooling1D(2, strides=2)(y)
concatenated = concatenate([x, y])
z = Flatten()(concatenated)
z = Dense(1024, activation='relu', kernel_regularizer=L2(0.0001))(z)
z = Dense(1024, activation='relu', kernel_regularizer=L2(0.0001))(z)
z = Dense(256, activation='relu', kernel_regularizer=L2(0.0001))(z)
z = Dense(1, activation='sigmoid')(z)
model = Model(inputs=input, outputs=z)
optimizer = SGD()
metrics = ['accuracy', 'Precision', 'Recall']
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=metrics)
print(model.summary())
return model
火车循环:
BATCH_SIZE = 64
EPOCHS = 400
K_FOLDS = 10
X = np.array(cropped_records)
y = np.array(dup_labels)
y = y[:,0].astype(int)
X = np.expand_dims(X, -1)
kf = KFold(n_splits=K_FOLDS, shuffle=True)
test_scores = []
fold_id = 0
train_time = datetime.now().strftime("%Y%m%d_%H%M%S")
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))
train_dataset = train_dataset.shuffle(buffer_size=100).batch(BATCH_SIZE).prefetch(buffer_size=BATCH_SIZE*3)
validation_dataset = validation_dataset.batch(BATCH_SIZE).prefetch(buffer_size=BATCH_SIZE*3)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(buffer_size=BATCH_SIZE*3)
logs_dir = 'logs/' + train_time + f'/{fold_id}'
if not os.path.exists(logs_dir):
os.makedirs(logs_dir)
model = create_dual_stream_cnn_model((X_train.shape[1], 1))
print_gpu_availability()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logs_dir, histogram_freq=1)
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)
model.fit(train_dataset,
epochs=EPOCHS, verbose=1,
callbacks=[lr_scheduler, tensorboard_callback])
test_loss, test_accuracy, test_precision, test_recall = model.evaluate(test_dataset)
y_scores = model.predict(X_test, verbose=0)
y_scores = y_scores.flatten()
test_fpr, test_tpr, _ = roc_curve(y_test, y_scores)
test_auc = roc_auc_score(y_test, y_scores)
test_scores.append({'loss':test_loss,
'acc': test_accuracy,
'prec':test_precision,
'rec':test_recall,
'auc':test_auc,
'fpr':test_fpr,
'tpr':test_tpr})
fold_id += 1
有很多方法可以减少训练的内存消耗,这是机器学习工程师的常见问题。
首先,如果我没记错的话,看起来你可能正在 CPU 上进行训练。 您需要安装 Nvidia Cuda 驱动程序才能在 GPU 上进行训练并利用 GPU 上的 VRAM。如果您使用的是 Windows,这可能会非常棘手,但它是可行的。您可以查找有关如何安装 TensorFlow 的 Cuda 驱动程序的指南。我认为您需要从 Nvidia 网站下载大约 3 个不同的文件,并且需要 Nvidia 开发者帐户。
您的模型复杂性非常大,将其缩小一点可能会有所帮助,因此请删除一些 Conv1D。当然,这取决于您的分类任务的复杂程度。我知道您的模型和数据集很小,但在训练期间您需要大量 RAM 来容纳复杂模型的所有张量。也许您还可以将类型更改为 float16 或不占用大量空间的类型。 您还可以尝试使用混合精度训练。 TensorFlow 提供了一种执行混合精度训练的方法,可以通过利用 RTX 4070 Ti 上的张量核心来减少内存使用并提高速度。您可以使用
tf.keras.mixed_precision
API 启用此功能。
如果您还没有听说过,您可以使用 Google Colab,您通常可以免费使用 T4 GPU,这比 RTX 4070 Ti 的训练效果更好。它还已经安装了 TensorFlow 所需的所有驱动程序。唯一的缺点是,由于 Colab 会超时,长时间运行训练作业可能会很困难。
希望对您有所帮助。 :)