我无法启动 Sagemaker 训练作业(在本地或使用 AWS 实例)。即使检查输出日志,我也无法找出解决此问题的任何说明。
重现 我正在使用 Sagemaker Pytorch Estimator、存储在 AWS ECR 中的自定义 Docker 映像以及来自 GitHub 的源代码。
from sagemaker.pytorch.estimator import PyTorch
role = "arn:..."
estimator = PyTorch(
image_uri="1...ecr...amazonaws.com/...:prototype",
git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
entry_point="main.py",
role=role,
region="us-...",
instance_type="local", # ml.g4dn.2xlarge
instance_count=1,
volume_size=225,
hyperparameters=hparams
)
estimator.fit()
预期行为
基本上,
estimator.fit()
调用应该克隆在GitHub上发布的模型训练脚本(在分支sagemaker
下,如git_config
参数所指出),拉取image_uri
中定义的docker镜像,并从main.py
如 entry_point
参数所述。
日志
Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file:
networks:
sagemaker-local:
name: sagemaker-local
services:
algo-1-55row:
command: train
container_name: 1l7x1nzly6-algo-1-55row
environment:
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
networks:
sagemaker-local:
aliases:
- algo-1-55row
stdin_open: true
tty: true
volumes:
- /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
- /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
- /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
- /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
Container 1l7x1nzly6-algo-1-55row Creating
Container 1l7x1nzly6-algo-1-55row Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
_stream_output(process)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_on_sagemaker.py", line 28, in run_on_sagemaker
estimator.fit()
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
return run_func(*args, **kwargs)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
estimator.sagemaker_session.train(**train_args)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
self._intercept_create_request(train_request, submit, self.train.__name__)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
return create(request)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
self.sagemaker_client.create_training_job(**request)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
training_job.start(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
self.model_artifacts = self.container.train(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
系统信息
Dockerfile(用于构建存储在AWS ECR中的镜像)
FROM python:3.10.12-buster
COPY requirements.txt /tmp/
RUN pip install --requirement /tmp/requirements.txt
需求.txt
hydra-core~=1.2.0
nmslib~=2.1.1
numpy~=1.22.0
omegaconf~=2.2.2
pandas~=1.4.3
pytorch-lightning~=1.6.5
pytorch-metric-learning~=1.5.2
ranx~=0.3.6
torch==1.13.1
torchmetrics~=0.9.2
tqdm~=4.64.0
transformers~=4.27.2
我注意到 Sagemaker Estimators 使用的 默认图像 包含 SageMaker Training Toolkit,其中包含容器中缺少的
train
命令。所以你可以尝试将它包含在你的 Dockerfile 中:
RUN pip install sagemaker-training
或包含在您的
requirements.txt
中:
sagemaker-training