我有一个如下所示的 SageMaker 管道(省略了参数和其他变量):
# 1. Model training step
estimator = TensorFlow(
entry_point="train.py",
source_dir=src_dir,
role=role,
instance_count=1,
instance_type="ml.m4.4xlarge",
framework_version="2.1",
py_version="py3",
base_job_name="quantitative-scores-training",
output_path=s3_training_output_file,
code_location=f"{base_dir}/code/"
)
training_inputs = {
'train': TrainingInput(
s3_data=s3_training_data_input_file,
content_type='text/csv',
input_mode='FastFile'
)
}
training_step = TrainingStep(
name='Train',
estimator=estimator,
inputs=training_inputs,
)
# 2. Create model step
model = Model(
entry_point='inference.py',
source_dir=src_dir,
model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
role=role,
sagemaker_session=sagemaker_session,
image_uri=estimator.training_image_uri(),
)
create_model_step = ModelStep(
name="ModelStep",
step_args=model.create(
instance_type='ml.m4.4xlarge'
),
)
# 3. Deploy model to endoint step
deploy_model_lambda_function = Lambda(
function_name="sagemaker-deploy-quant-score",
execution_role_arn=create_sagemaker_lambda_role("deploy-model-lambda-role"),
script="/home/ec2-user/SageMaker/my_path/src/util/deploy_model_lambda.py",
handler="deploy_model_lambda.lambda_handler",
)
deploy_model_step = LambdaStep(
name="DeployModelStep",
lambda_func=deploy_model_lambda_function,
inputs={
"model_name": create_model_step.properties.ModelName,
"endpoint_config_name": "quantitative-scoring-pipeline-config",
"endpoint_name": endpoint_name,
"endpoint_instance_type": "ml.m4.xlarge",
},
)
# Connect pipeline
pipe = Pipeline(
name="QuantitativeScoringPipeline",
steps=[
training_step,
create_model_step,
deploy_model_step
],
parameters=[
# I omitted these definitions above
s3_training_data_input_file,
s3_training_output_file,
endpoint_name
],
)
pipe.upsert(role_arn=role)
execution = pipe.start()
但是,当涉及到 lambda 部署端点时,lambda 会成功,但端点的创建总是稍后会失败。容器永远不会启动,因此 CloudWatch 中没有日志,但我收到消息
CannotStartContainerError. Please ensure the model container for variant AllTraffic starts correctly when invoked with 'docker run <image> serve'
。显然我没有使用自定义容器。
奇怪的是,如果我使用 SageMaker SDK 创建/更新端点(如下所示),使用相同的模型 S3 URI,它工作得绝对正常。这与上面失败的模型完全相同。
model = TensorFlowModel(
entry_point='inference.py',
source_dir='src',
model_data="s3://sagemaker-eu-west-1-558091818291/tensorflow-training-2024-04-25-12-11-21-401/pipelines-dgstz6rrp8u9-ModelStep-RepackMode-P5O95TSntC/output/model.tar.gz",
role=role,
framework_version="2.1",
)
predictor = model.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1, endpoint_name=endpoint_name)
然而,第二种方法创建了一个新模型焦油球。我检查了两个模型焦油球的内容,推理代码和模型数据看起来相同。我真的很困惑为什么管道无法更新端点但它成功了。我能想到的是我使用的是框架版本而不是特定的图像 URI,但不知道如何解决这个问题。
通过用
Model
替换 TensorFlowModel
对象解决了该问题。这使我能够停止指定图像 URI,而只需传递要使用的框架版本。
即这个:
model = Model(
entry_point='inference.py',
source_dir=src_dir,
model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
role=role,
sagemaker_session=sagemaker_session,
image_uri=estimator.training_image_uri(),
)
变成了这样:
model = TensorFlowModel(
entry_point='inference.py',
source_dir=src_dir,
framework_version="2.1",
model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=sagemaker_session,
role=role
)
尽管来自 SageMaker 示例,但似乎
estimator.training_image_uri()
不适合 TensorFlow 🤷。