“Train”:在 $PATH 中找不到可执行文件

问题描述 投票:0回答:1

我无法启动 Sagemaker 训练作业(在本地或使用 AWS 实例)。即使检查输出日志,我也无法找出解决此问题的任何说明。

重现 我正在使用 Sagemaker Pytorch Estimator、存储在 AWS ECR 中的自定义 Docker 映像以及来自 GitHub 的源代码。

from sagemaker.pytorch.estimator import PyTorch

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

预期行为

基本上,

estimator.fit()
调用应该克隆在GitHub上发布的模型训练脚本(在分支
sagemaker
下,如
git_config
参数所指出),拉取
image_uri
中定义的docker镜像,并从
main.py
entry_point
参数所述。

日志

Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-55row:
    command: train
    container_name: 1l7x1nzly6-algo-1-55row
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
    networks:
      sagemaker-local:
        aliases:
        - algo-1-55row
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
    - /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
    - /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
    - /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'

[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
 Container 1l7x1nzly6-algo-1-55row  Creating
 Container 1l7x1nzly6-algo-1-55row  Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
    _stream_output(process)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_on_sagemaker.py", line 28, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
    self._intercept_create_request(train_request, submit, self.train.__name__)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
    return create(request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
    self.sagemaker_client.create_training_job(**request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
    training_job.start(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
    self.model_artifacts = self.container.train(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1


系统信息

  • SageMaker Python SDK 版本:sagemaker 2.192.0
  • 框架名称(例如 PyTorch)或算法(例如 KMeans):Pytorch 2.0.1
  • Python版本:Python 3.10
  • Docker:24.0.6
  • 自定义 Docker 映像(是/否):是,在 ECR 上。

Dockerfile(用于构建存储在AWS ECR中的镜像)

FROM python:3.10.12-buster

COPY requirements.txt /tmp/
RUN pip install --requirement /tmp/requirements.txt

需求.txt

hydra-core~=1.2.0
nmslib~=2.1.1
numpy~=1.22.0
omegaconf~=2.2.2
pandas~=1.4.3
pytorch-lightning~=1.6.5
pytorch-metric-learning~=1.5.2
ranx~=0.3.6
torch==1.13.1
torchmetrics~=0.9.2
tqdm~=4.64.0
transformers~=4.27.2
docker pytorch amazon-sagemaker
1个回答
0
投票

我注意到 Sagemaker Estimators 使用的 默认图像 包含 SageMaker Training Toolkit,其中包含容器中缺少的

train
命令。所以你可以尝试将它包含在你的 Dockerfile 中:

RUN pip install sagemaker-training

或包含在您的

requirements.txt
中:

sagemaker-training
© www.soinside.com 2019 - 2024. All rights reserved.