无法正确注册模型并使用 Sagemaker Pipelines 创建 Sagemaker 端点

问题描述 投票:0回答:1

我尝试在 Sagemaker Studio 内的 Sagemaker Pipelines 中使用 Pytorch 容器注册和部署自定义模型,但使用

invoke_endpoint
发送响应时端点失败:

代码片段是:

##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
    entry_point="inference.py",
    framework_version='1.13',
    py_version='py39',
    source_dir="code",
    # sagemaker_session=pipeline_session, # I've tried this but doesn't work
    role=role,
    instance_type=training_instance,
    instance_count=1,
    base_job_name=f"{base_job_prefix}-{training_job_name}",
    output_path=s3_output_path,
    code_location=s3_training_output_path,
    # script_mode=True,
    hyperparameters={
        "model_name": model_name,
        "model_type": model_type,
        "bucket": bucket,
        'epsilon': 0.3
    },
    model_name=model_name + workflow_time
)

# put it on the outside because fitting it inside TrainingStep isn't work
model.fit()

step_train = TrainingStep(
    name=training_step_name,
    # step_args=model.fit(),  # I've tried this but it fails
    estimator=model,
)

# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')

step_register = RegisterModel(
    name=register_model_step_name,
    estimator=model,
    # model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=inference_instances,
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    depends_on=[training_step_name]
)

这部分将模型注册到模型注册表中。然后,我获取最新版本,构建端点配置并使用以下方式进行部署:

# create an endpoint using model registry model config previosly created
sm_client = boto3.client('sagemaker', region_name=AWS_REGION) 

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=ENDPOINT_NAME,
    EndpointConfigName=endpoint_config_name
)

端点超时了

ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.eu-west-1.amazonaws.com/endpoints/nba-vw-base-endpoint-TEST/invocations"

我尝试过不同的组合:

  • 使用管道会话
  • 在步骤内部、外部添加
    .fit()
    ,或使用
    estimator
    arg
  • 使用
    RegisterModel()
    model.register()

我检查了端点的日志,没有看到错误。

但同样的问题。我遵循了很多示例,例如 this one,但是当添加

model.fit()
而没有
pipeline_session
时,它指出
TrainingStep()
需要
estimator
step_args
参数,这意味着
.fit()
不返回任何内容。

更新:在

.fit()
内使用
TrainingStep()
示例: 当遵循如下所示的许多示例时:

step_train = TrainingStep(
    name=training_step_name,
    step_args=model.fit(), 
)

训练作业在日志中运行良好,但我收到此错误:

2024-02-14 13:16:00 Completed - Training job completed
Training seconds: 112
Billable seconds: 112
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[22], line 1
----> 1 step_train = TrainingStep(
      2     name=training_step_name,
      3     step_args=model.fit(),  # need to fit the model to ensure it properly trains and creates inference logic
      4     # estimator=model,  # seems to be getting deprecated in future
      5 )

File /opt/conda/lib/python3.10/site-packages/sagemaker/workflow/steps.py:417, in TrainingStep.__init__(self, name, step_args, estimator, display_name, description, inputs, cache_config, depends_on, retry_policies)
    412 super(TrainingStep, self).__init__(
    413     name, StepTypeEnum.TRAINING, display_name, description, depends_on, retry_policies
    414 )
    416 if not (step_args is not None) ^ (estimator is not None):
--> 417     raise ValueError("Either step_args or estimator need to be given.")
    419 if step_args:
    420     from sagemaker.workflow.utilities import validate_step_args_input

ValueError: Either step_args or estimator need to be given.

意味着

.fit()
没有返回值,因此将其设置为 None。不确定所有其他示例如何没有相同的问题。

抱歉提供了所有信息,但我不确定下一步要尝试什么。

pytorch amazon-sagemaker mlops amazon-sagemaker-studio
1个回答
0
投票

已解决

我在 gokul-pv github 上找到了解决方案。由于某种原因,似乎无法使用相同的 PyTorch 模型进行训练和注册。

您需要使用 PyTorchModel() 创建一个新实例,然后注册它。现在可以了。更新代码如下:

##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
    entry_point="train.py",
    image_uri=pytorch_image_uri_training,
    source_dir="code",
    sagemaker_session=pipeline_session,
    role=role,
    instance_type=training_instance,
    instance_count=1,
    base_job_name=f"{base_job_prefix}-{training_job_name}",
    output_path=s3_output_path,
    code_location=s3_training_output_path,
    # script_mode=True,
    hyperparameters={
        "model_name": model_name,
        "model_type": model_type,
        "bucket": bucket,
        'epsilon': 0.3
    },
    model_name=model_name + workflow_time
)

training_step_args = model.fit()

step_train = TrainingStep(
    name=training_step_name,
    step_args=training_step_args,
)

# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')
model = PyTorchModel(
    entry_point="inference.py",
    source_dir="code",
    image_uri=pytorch_image_uri_inference,
    sagemaker_session=pipeline_session,
    role=role,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    framework_version="1.11.0",
)


reg_model_args = model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    model_package_group_name=model_package_group_name,
    inference_instances=inference_instances,
    approval_status=model_approval_status,
    description="pipeline - nba vw model test"
)

# Register model step that will be conditionally executed
step_register = ModelStep(
    name=register_model_step_name,
    step_args=reg_model_args,
    # depends_on=[training_step_name]
)

我需要做的唯一重大改变就是使

training.py
独立于
inference.py
,而不是依赖。

© www.soinside.com 2019 - 2024. All rights reserved.