AWS Sagemaker 培训 - 容器参数 - Boto3 API

问题描述 投票:0回答:1

需要有关使用 Boto3 API 为 Sagemaker 训练作业传递命令行参数的指导。请找到我的 docker 文件

FROM public.ecr.aws/ubuntu/ubuntu:22.04

LABEL version="2.0"

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         build-essential \
         python3-dev \
         python3-pip \
         python3-setuptools \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*
RUN python3.10 -m pip install pip --upgrade && pip install --upgrade cython
RUN ln -s /usr/bin/python3 /usr/bin/python
COPY requirements.txt .
RUN pip --no-cache-dir install -r requirements.txt

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/ml/code/:${PATH}"
ENV PYTHONPATH="/opt/ml/code/:${PYTHONPATH}"

COPY src/ /opt/ml/code/
WORKDIR /opt/ml/code/

ENTRYPOINT [ "python", "/opt/ml/code/entry_point.py" ]

entry_point.py脚本如下

parser = argparse.ArgumentParser()
parser.add_argument("--mode", type=str, required=True)
parser.add_argument("--region", type=int)

args = parser.parse_args()

if args.mode == "inference":
        run_inference(args.region_id)
    elif args.mode == "training":
        run_training(args.region_id)
    else:
        raise ValueError(f"Unknown mode: {args.mode}")

该镜像已发布到AWS ECR。现在使用 boto3 API 调用如下来启动作业

session = boto3.Session(profile_name='algoprod')
client = session.client('sagemaker', region_name='us-east-1')
training_job_name = 'sagemaker-training-demo'
resp = client.create_training_job(
                    TrainingJobName=training_job_name,
                    RoleArn="xxxx",
                    AlgorithmSpecification={
                            'TrainingImage': "image:latest",
                            'TrainingInputMode': "File",
                            'ContainerArguments': [
                                    '--mode training',
                                    '--region_id 1',
                             ]
    )

    print(resp)

上述使用 boto3 的 API 调用成功启动了 AWS 中的 Sagemaker 训练,但训练作业失败并出现以下错误消息

entry_point.py: error: the following arguments are required: --mode

模式已按照 Boto3 文档中的指导通过 ContainerArguments 传递 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html

请指教

python-3.x boto3 amazon-sagemaker
1个回答
0
投票

也许解决方案就像将

training
放入引号中一样简单
"training"

'ContainerArguments': ['--mode "training"',
                       '--region_id 1',]

1
被理解为整数,但不带引号的
training
被解释为变量。

© www.soinside.com 2019 - 2024. All rights reserved.