转换 triton 容器以与 sagemaker MME 配合使用

问题描述 投票:0回答:0

我有一个使用 python 后端的自定义 triton docker 容器。这个容器在本地完美运行。

这是容器dockerfile(我省略了不相关的部分)。

ARG TRITON_RELEASE_VERSION=22.12
FROM nvcr.io/nvidia/tritonserver:${TRITON_RELEASE_VERSION}-pyt-python-py3

LABEL owner='toing'
LABEL maintainer='[email protected]'

LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

ARG TRITON_RELEASE_VERSION

ENV DEBIAN_FRONTEND=noninteractive
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

ENV GIT_TRITON_RELEASE_VERSION="r$TRITON_RELEASE_VERSION"
ENV TRITON_MODEL_DIRECTORY="/opt/ml/model"

SHELL ["/bin/bash", "-c"]

# nvidia updated their repository keys recently
RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    # generic requirements
    gcc \
    libgl1-mesa-glx

RUN pip install --upgrade pip && \
    pip install --no-cache-dir setuptools \
    scikit-build \
    opencv-python-headless \
    cryptography

# run create model dir
RUN mkdir -p $TRITON_MODEL_DIRECTORY

# for mmcv installation
ENV FORCE_CUDA="1"

# set TORCH_CUDA_ARCH_LIST
ENV TORCH_CUDA_ARCH_LIST="7.5"

RUN pip install --no-cache-dir what-i-need --index-url 

# install pytorch requirements from aws
RUN mkdir -p /app/snapshots && \
    mkdir -p /keys

# Copy the requirements files
ADD requirements/build.txt /install/build.txt

# install specific packages
RUN pip install --no-cache-dir -r /install/build.txt

# number of workers per model
ENV SAGEMAKER_MODEL_SERVER_WORKERS=1
ENV SAGEMAKER_BIND_TO_PORT=8000
ENV SAGEMAKER_SAFE_PORT_RANGE=8000-8002

# HTTP Inference Service
EXPOSE 8000

# GRPC Inference Service
EXPOSE 8001

# Metrics Service
EXPOSE 8002

RUN echo -e "#!/bin/bash\n\
tritonserver --model-repository ${TRITON_MODEL_DIRECTORY}"\
>> /start.sh

RUN chmod +x /start.sh

# Set the working directory to /
WORKDIR /

ENTRYPOINT ["/start.sh"]

问题是,当我从 sagemaker MME 端点启动它时,triton 服务器启动并运行,但显然 sagemaker 无法检测到正在运行的服务器,因此运行状况检查失败并且端点创建失败。

我使用了错误的端口,或者我应该怎样做才能避免此错误?

PS:我确实看到此 dockerfile 中使用的基本 NGC 容器使用位于

/opt/nvidia/nvidia_entrypoint.sh
的入口点,但代码似乎只是原始入口点的包装器。

docker nvidia amazon-sagemaker tritonserver
© www.soinside.com 2019 - 2024. All rights reserved.