如何解决这个InternalError：在Optuna中优化超参数时出现图形执行错误？

Question

我一直在 WSL 中的 Jupyter Notebook (Python 3.x) 上的 Optuna 中优化多个 TensorFlow 神经网络模型的超参数，进行了数百次试验，之前没有出现任何问题，直到我认为我应该保存我的研究以供将来参考。我有一个类，在其中定义了

objective_function()

和

optimize()

方法，并且修改了

optimize_study()

方法，以便我可以将研究转储到 .pkl 文件中：

def optimize_study(self):
        from optuna.visualization import plot_optimization_history
        from optuna.importance import get_param_importances
        study = optuna.create_study(direction = "minimize", sampler = optuna.samplers.TPESampler(),
                                    pruner = optuna.pruners.HyperbandPruner(), study_name=self.study_name)
        study.optimize(self.objective_function, n_trials = self.n_trials, gc_after_trial=True)
        # gc_after_trial added later
        plot_optimization_history(study).show()
        print(get_param_importances(study))
        joblib.dump(study, f"{self.study_name}.pkl") # Line added later
        return study.best_params, study.best_value

当我现在运行超参数优化试验（设置为 5）时，试验运行良好，直到第三次试验，我得到：

E tensorflow/stream_executor/dnn.cc:868] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(2683): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'

W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at cudnn_rnn_ops.cc:1563 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 91, 81, 1, 1, 128, 81] 

Trial 2 failed with parameters: {'units': 81, 'activation': 'softsign', 'dropout': 0.07633939325087957, 'optimizer': 'Adam', 'adam_learning_rate': 0.01799516104446331, 'filters': 91} because of the following error: InternalError().

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/tmp/ipykernel_3632372/3676345196.py", line 167, in objective_function
    self.neural_network.train_model(test_model)
  File "/tmp/ipykernel_3632372/227766830.py", line 178, in train_model
    history = model.fit(self.x_train, self.y_train, epochs = epoch_size, batch_size = BATCH_SIZE, callbacks = [early_stop],
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 91, 81, 1, 1, 128, 81] 
     [[{{node CudnnRNN}}]]
     [[sequential/lstm/PartitionedCall]] [Op:__inference_train_function_121807]

对于更密集的神经网络架构，我在第一次尝试时遇到了类似的错误，但内存分配日志很长，表明可用内存已耗尽，例如：

E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 27
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 8
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 272, 3
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 332, 3
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 512, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 544, 6
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 7968, 4
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 12288, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 18496, 6
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 42496, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 45152, 6
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 93908, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 751264, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16819712, 1
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:90] CU_MEMPOOL_ATTR_RESERVED_MEM_CURRENT: 67108864
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:92] CU_MEMPOOL_ATTR_USED_MEM_CURRENT: 18140216
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:93] CU_MEMPOOL_ATTR_RESERVED_MEM_HIGH: 67108864
E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:94] CU_MEMPOOL_ATTR_USED_MEM_HIGH: 34937704
E tensorflow/stream_executor/dnn.cc:868] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(2683): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at cudnn_rnn_ops.cc:1563 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 83, 34, 1, 1, 128, 34] 
E tensorflow/stream_executor/dnn.cc:868] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(2683): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at cudnn_rnn_ops.cc:1563 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 83, 34, 1, 1, 128, 34] 
Trial 0 failed with parameters: {'units': 34, 'activation': 'softsign', 'dropout': 0.1391979014457847, 'optimizer': 'Adam', 'adam_learning_rate': 0.07514111264388643, 'filters': 83} because of the following error: InternalError().

当超参数优化没有问题时，我尝试通过注释掉

joblib.dump(study, f"{self.study_name}.pkl")

和

gc_after_trial=True

将代码恢复到之前的状态。我仍然遇到与上面相同的错误。我没有更改在单独的类中实现的模型训练函数中的任何内容，我在包含

optimize_study()

的类中实例化了该对象。

我以前从未遇到过这个错误，在一次会话中使用我拥有的 GPU 内存（~5GB）对超过 8-10 个模型进行 500 次试验进行优化，所以我不明白为什么 GPU 内存现在不足。我觉得某些模块/文件中的某些变量设置不同，无法找到解决办法。我查看了this question，但只有在将上述两个代码块添加到

optimize_study()

后，我才面临内存耗尽的问题，所以这似乎有所不同。

对于为什么会发生这种情况以及如何解决此问题有什么想法吗？

编辑：显示的内存分配日志是原始日志的截断版本。确切的代码太长，无法重新发布，但以下是代码在优化器类 O 和 NN 类 N 之间流动的方式（考虑 GRU 模型）：

O.optimize()

=>

O.optimize_study()

=>

O.objective_function()

=>

N.build_GRU_model()

=>

N.train_model()

=>

N.predict()

=>

N.evaluate_loss_function

优化器功能

def optimize(self):
    best_params, best_values = self.optimize_study()
    print(f"Best params: {best_params}\n Best value: {best_values}")
    return self


def objective_function(self, trial):
    units = trial.suggest_int('units', 10, 50)
            activation = trial.suggest_categorical("activation", ['relu', 'tanh', 'softsign'])
    dropout = trial.suggest_float('dropout', 0.01, 0.5)
    test_model = self.build_deep_GRU_model(trial)
    self.neural_network.train_model(test_model)
    y_true, y_pred = self.neural_network.predict(test_model)     
    return self.neural_network.evaluate_loss_function(y_true, y_pred)

神经网络函数

def build_GRU_model(self, hidden_neurons, activator, drop_out, OPTIMIZER = 'adam'):   
    keras.backend.clear_session()
    GRU_layer = keras.layers.GRU(hidden_neurons, dropout = drop_out, activation = activator)
    gru_model = keras.Sequential(layers = (GRU_layer, keras.layers.Dense(self.output_neurons)))
    gru_model.reset_states()
    gru_model.compile(optimizer = OPTIMIZER, loss = self.mae)
    return gru_model


def train_model(self, model, epoch_size = 150, BATCH_SIZE = BATCH_SIZE):
    early_stop = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 30, mode = 'min')
    history = model.fit(self.x_train, self.y_train, epochs = epoch_size, batch_size = BATCH_SIZE, callbacks = [early_stop], validation_data = (self.x_valid, self.y_valid), shuffle = False)
    print(model.summary())
    return history

objective_function()

中的预测和损失评估器函数只是撤消缩放并输出用于超参数优化的损失函数。异常在

history = model.fit(...)

行引发。

Answer 1

事实证明，我在后台运行的远程服务器上有多个 Jupyter 内核会话，即使我与服务器断开连接也会占用内存。网络架构和添加

joblib.dump()

与这个图执行错误无关。我验证了问题并未在我的笔记本电脑 GPU 上复制，并且在服务器 GPU 上运行额外的会话导致 OOM。

如何解决这个InternalError：在Optuna中优化超参数时出现图形执行错误？

问题描述投票：0回答：1

1个回答

最新问题

如何解决这个InternalError：在Optuna中优化超参数时出现图形执行错误？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1