我尝试在 Sagemaker Studio 内的 Sagemaker Pipelines 中使用 Pytorch 容器注册和部署自定义模型,但使用
invoke_endpoint
发送响应时端点失败:
代码片段是:
##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
entry_point="inference.py",
framework_version='1.13',
py_version='py39',
source_dir="code",
# sagemaker_session=pipeline_session, # I've tried this but doesn't work
role=role,
instance_type=training_instance,
instance_count=1,
base_job_name=f"{base_job_prefix}-{training_job_name}",
output_path=s3_output_path,
code_location=s3_training_output_path,
# script_mode=True,
hyperparameters={
"model_name": model_name,
"model_type": model_type,
"bucket": bucket,
'epsilon': 0.3
},
model_name=model_name + workflow_time
)
# put it on the outside because fitting it inside TrainingStep isn't work
model.fit()
step_train = TrainingStep(
name=training_step_name,
# step_args=model.fit(), # I've tried this but it fails
estimator=model,
)
# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')
step_register = RegisterModel(
name=register_model_step_name,
estimator=model,
# model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
content_types=["application/json"],
response_types=["application/json"],
inference_instances=inference_instances,
model_package_group_name=model_package_group_name,
approval_status=model_approval_status,
depends_on=[training_step_name]
)
这部分将模型注册到模型注册表中。然后,我获取最新版本,构建端点配置并使用以下方式进行部署:
# create an endpoint using model registry model config previosly created
sm_client = boto3.client('sagemaker', region_name=AWS_REGION)
create_endpoint_response = sm_client.create_endpoint(
EndpointName=ENDPOINT_NAME,
EndpointConfigName=endpoint_config_name
)
端点超时了
ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.eu-west-1.amazonaws.com/endpoints/nba-vw-base-endpoint-TEST/invocations"
我尝试过不同的组合:
.fit()
,或使用 estimator
argRegisterModel()
或 model.register()
我检查了端点的日志,没有看到错误。
但同样的问题。我遵循了很多示例,例如 this one,但是当添加
model.fit()
而没有 pipeline_session
时,它指出 TrainingStep()
需要 estimator
或 step_args
参数,这意味着 .fit()
不返回任何内容。
更新:在
.fit()
内使用 TrainingStep()
示例:
当遵循如下所示的许多示例时:
step_train = TrainingStep(
name=training_step_name,
step_args=model.fit(),
)
训练作业在日志中运行良好,但我收到此错误:
2024-02-14 13:16:00 Completed - Training job completed
Training seconds: 112
Billable seconds: 112
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[22], line 1
----> 1 step_train = TrainingStep(
2 name=training_step_name,
3 step_args=model.fit(), # need to fit the model to ensure it properly trains and creates inference logic
4 # estimator=model, # seems to be getting deprecated in future
5 )
File /opt/conda/lib/python3.10/site-packages/sagemaker/workflow/steps.py:417, in TrainingStep.__init__(self, name, step_args, estimator, display_name, description, inputs, cache_config, depends_on, retry_policies)
412 super(TrainingStep, self).__init__(
413 name, StepTypeEnum.TRAINING, display_name, description, depends_on, retry_policies
414 )
416 if not (step_args is not None) ^ (estimator is not None):
--> 417 raise ValueError("Either step_args or estimator need to be given.")
419 if step_args:
420 from sagemaker.workflow.utilities import validate_step_args_input
ValueError: Either step_args or estimator need to be given.
意味着
.fit()
没有返回值,因此将其设置为 None。不确定所有其他示例如何没有相同的问题。
抱歉提供了所有信息,但我不确定下一步要尝试什么。
已解决
我在 gokul-pv github 上找到了解决方案。由于某种原因,似乎无法使用相同的 PyTorch 模型进行训练和注册。
您需要使用 PyTorchModel() 创建一个新实例,然后注册它。现在可以了。更新代码如下:
##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
entry_point="train.py",
image_uri=pytorch_image_uri_training,
source_dir="code",
sagemaker_session=pipeline_session,
role=role,
instance_type=training_instance,
instance_count=1,
base_job_name=f"{base_job_prefix}-{training_job_name}",
output_path=s3_output_path,
code_location=s3_training_output_path,
# script_mode=True,
hyperparameters={
"model_name": model_name,
"model_type": model_type,
"bucket": bucket,
'epsilon': 0.3
},
model_name=model_name + workflow_time
)
training_step_args = model.fit()
step_train = TrainingStep(
name=training_step_name,
step_args=training_step_args,
)
# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')
model = PyTorchModel(
entry_point="inference.py",
source_dir="code",
image_uri=pytorch_image_uri_inference,
sagemaker_session=pipeline_session,
role=role,
model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
framework_version="1.11.0",
)
reg_model_args = model.register(
content_types=["application/json"],
response_types=["application/json"],
model_package_group_name=model_package_group_name,
inference_instances=inference_instances,
approval_status=model_approval_status,
description="pipeline - nba vw model test"
)
# Register model step that will be conditionally executed
step_register = ModelStep(
name=register_model_step_name,
step_args=reg_model_args,
# depends_on=[training_step_name]
)
我需要做的唯一重大改变就是使
training.py
独立于 inference.py
,而不是依赖。