如何修复此 Databricks 分布式培训教程工作簿中的此运行时错误

Question

我正在关注从这篇文章中找到的这个笔记本。我正在尝试使用单个节点和多个 GPU 来微调模型，因此我将所有内容运行到“运行本地训练”部分，但从那里我跳到“在具有多个 GPU 的单个节点上运行分布式训练”。当我运行第一个块时，我收到此错误：

RuntimeError: TorchDistributor failed during training. View stdout logs for detailed error message.

这是我从代码块中看到的完整输出：

We're using 4 GPUs
Started local training with 4 processes
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2023-08-22 19:31:47.794586: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.809864: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.824423: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.828933: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/tmp/tmpz1ss252g/train.py", line 8, in <module>
    output = train_fn(*args)
  File "<command-2821949673242075>", line 46, in train_model
  File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1855, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 1021, in on_train_begin
    self.setup(args, state, model)
  File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 990, in setup
    self._ml_flow.start_run(run_name=args.run_name, nested=self._nested_run)
  File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 363, in start_run
    active_run_obj = client.create_run(
  File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/client.py", line 326, in create_run
    return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
  File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 133, in create_run
    return self.store.create_run(
  File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 178, in create_run
    response_proto = self._call_endpoint(CreateRun, req_body)
  File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/databricks/python/lib/python3.10/site-packages/mlflow/utils/databricks_utils.py", line 422, in get_databricks_host_creds
    config = provider.get_config()
  File "/databricks/python/lib/python3.10/site-packages/databricks_cli/configure/provider.py", line 134, in get_config
    raise InvalidConfigurationError.for_profile(None)
databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/tmp/tmpz1ss252g/train.py configure`
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2572 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2573 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2571) of binary: /local_disk0/.ephemeral_nfs/envs/pythonEnv-3b3dff80-496a-4c7d-9684-b04a17a299d3/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/tmp/tmpz1ss252g/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-22_19:31:58
  host      : 0821-144503-em46c4jc-10-52-237-200
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2571)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

我是否需要启用更多回溯才能看到更多错误？我是否需要“配置 CLI”（无论这意味着什么）？有什么非常明显的东西我错过了吗？

我使用的是带有 4 个 GPU 的 g5.12xlarge，我的 DataBricks 运行时版本是“13.2 ML（包括 Apache Spark 3.4.0、GPU、Scala 2.12）”。我在 DataBricks 笔记本中运行它。

Answer 1

事实证明，我确实需要配置 CLI。我从未想过我必须在 databricks 笔记本中执行此操作，但 Spark 是一个严厉的情妇。为此，请单击右下角看起来像终端的图标。当您将鼠标悬停在其上时，它会显示“打开底部面板”。然后输入

databricks configure

并从那里按照步骤操作。不要关闭面板。它是短暂的，因此只有在微调完成后才将其关闭。

从那里运行它导致我遇到另一个错误；我已经注释掉了这一行

report_to=['tensorboard'], # REMOVE MLFLOW INTEGRATION FOR NOW

这是一个错误：

mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.

只有取消注释该行并配置 CLI 后，代码才起作用。

如何修复此 Databricks 分布式培训教程工作簿中的此运行时错误

问题描述投票：0回答：1

1个回答

最新问题

如何修复此 Databricks 分布式培训教程工作簿中的此运行时错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1