SageMaker 管道端点部署失败 - CannotStartContainerError

问题描述 投票:0回答:1

我有一个如下所示的 SageMaker 管道(省略了参数和其他变量):

# 1. Model training step
estimator = TensorFlow(
    entry_point="train.py",
    source_dir=src_dir,
    role=role,
    instance_count=1,
    instance_type="ml.m4.4xlarge",
    framework_version="2.1",
    py_version="py3",
    base_job_name="quantitative-scores-training",
    output_path=s3_training_output_file,
    code_location=f"{base_dir}/code/"
)

training_inputs = {
    'train': TrainingInput(
        s3_data=s3_training_data_input_file,
        content_type='text/csv',
        input_mode='FastFile'
    )
}

training_step = TrainingStep(
    name='Train',
    estimator=estimator,
    inputs=training_inputs,
)

# 2. Create model step
model = Model(
    entry_point='inference.py',
    source_dir=src_dir,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    sagemaker_session=sagemaker_session,
    image_uri=estimator.training_image_uri(),
)

create_model_step = ModelStep(
    name="ModelStep",
    step_args=model.create(
        instance_type='ml.m4.4xlarge'
    ),
)

# 3. Deploy model to endoint step
deploy_model_lambda_function = Lambda(
    function_name="sagemaker-deploy-quant-score",
    execution_role_arn=create_sagemaker_lambda_role("deploy-model-lambda-role"),
    script="/home/ec2-user/SageMaker/my_path/src/util/deploy_model_lambda.py",
    handler="deploy_model_lambda.lambda_handler",
)

deploy_model_step = LambdaStep(
    name="DeployModelStep",
    lambda_func=deploy_model_lambda_function,
    inputs={
        "model_name": create_model_step.properties.ModelName,
        "endpoint_config_name": "quantitative-scoring-pipeline-config",
        "endpoint_name": endpoint_name,
        "endpoint_instance_type": "ml.m4.xlarge",
    },
)

# Connect pipeline
pipe = Pipeline(
    name="QuantitativeScoringPipeline",
    steps=[
        training_step,
        create_model_step,
        deploy_model_step
    ],
    parameters=[
        # I omitted these definitions above
        s3_training_data_input_file,
        s3_training_output_file,
        endpoint_name
    ],
)
pipe.upsert(role_arn=role)
execution = pipe.start()

但是,当涉及到 lambda 部署端点时,lambda 会成功,但端点的创建总是稍后会失败。容器永远不会启动,因此 CloudWatch 中没有日志,但我收到消息

CannotStartContainerError. Please ensure the model container for variant AllTraffic starts correctly when invoked with 'docker run <image> serve'
。显然我没有使用自定义容器。

奇怪的是,如果我使用 SageMaker SDK 创建/更新端点(如下所示),使用相同的模型 S3 URI,它工作得绝对正常。这与上面失败的模型完全相同。

model = TensorFlowModel(
    entry_point='inference.py',
    source_dir='src',
    model_data="s3://sagemaker-eu-west-1-558091818291/tensorflow-training-2024-04-25-12-11-21-401/pipelines-dgstz6rrp8u9-ModelStep-RepackMode-P5O95TSntC/output/model.tar.gz",
    role=role,
    framework_version="2.1",
)
predictor = model.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1, endpoint_name=endpoint_name)

然而,第二种方法创建了一个新模型焦油球。我检查了两个模型焦油球的内容,推理代码和模型数据看起来相同。我真的很困惑为什么管道无法更新端点但它成功了。我能想到的是我使用的是框架版本而不是特定的图像 URI,但不知道如何解决这个问题。

python tensorflow amazon-sagemaker
1个回答
0
投票

通过用

Model
替换
TensorFlowModel
对象解决了该问题。这使我能够停止指定图像 URI,而只需传递要使用的框架版本。

即这个:

model = Model(
    entry_point='inference.py',
    source_dir=src_dir,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    sagemaker_session=sagemaker_session,
    image_uri=estimator.training_image_uri(),
)

变成了这样:

model = TensorFlowModel(
    entry_point='inference.py',
    source_dir=src_dir,
    framework_version="2.1",
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=sagemaker_session,
    role=role
)

尽管来自 SageMaker 示例,但似乎

estimator.training_image_uri()
不适合 TensorFlow 🤷。

© www.soinside.com 2019 - 2024. All rights reserved.