没有与 AWS Batch 关联的 GPU EC2 实例

问题描述 投票:0回答:1

我需要在 AWS Batch 上设置 GPU 支持的实例。

这是我的 .yaml 文件:

  GPULargeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64:
            Fn::Sub: |
              MIME-Version: 1.0
              Content-Type: multipart/mixed; boundary="==BOUNDARY=="

              --==BOUNDARY==
              Content-Type: text/cloud-config; charset="us-ascii"

              runcmd:
                - yum install -y aws-cfn-bootstrap
                - echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
                - /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
                - echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
                - echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
                - echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
                - echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
                - /usr/bin/docker-storage-setup
                - yum update -y
                - echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
                - /etc/init.d/docker restart

              --==BOUNDARY==--
      LaunchTemplateName: GPULargeLaunchTemplate

  GPULargeBatchComputeEnvironment:
    DependsOn:
      - ComputeRole
      - ComputeInstanceProfile
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeResources:
        ImageId: ami-GPU-optimized-AMI-ID
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        LaunchTemplate:
          LaunchTemplateId:
            Ref: GPULargeLaunchTemplate
          Version:
            Fn::GetAtt:
              - GPULargeLaunchTemplate
              - LatestVersionNumber
        InstanceRole:
          Ref: ComputeInstanceProfile
        InstanceTypes:
          - g4dn.xlarge
        MaxvCpus: 768
        MinvCpus: 1
        SecurityGroupIds:
          - Fn::GetAtt:
              - ComputeSecurityGroup
              - GroupId
        Subnets:
          - Ref: ComputePrivateSubnetA
        Type: EC2
        UpdateToLatestImageVersion: True

  MyGPUBatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      ComputeEnvironmentOrder:
        - ComputeEnvironment:
            Ref: GPULargeBatchComputeEnvironment
          Order: 1
      Priority: 5
      JobQueueName: MyGPUBatchJobQueue
      State: ENABLED

  MyGPUJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      ContainerProperties:
        Command:
          - "/opt/bin/python3"
          - "/opt/bin/start.py"
          - "--retry_count"
          - "Ref::batchRetryCount"
          - "--retry_limit"
          - "Ref::batchRetryLimit"
        Environment:
          - Name: "Region"
            Value: "us-west-2"
          - Name: "LANG"
            Value: "en_US.UTF-8"
        Image:
          Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
        JobRoleArn:
          Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
        Memory: 16000
        Vcpus: 1
        ResourceRequirements:
          - Type: GPU
            Value: '1'
      JobDefinitionName: MyGPUJobDefinition
      Timeout:
        AttemptDurationSeconds: 500

当我开始一项工作时,该工作永远停留在 RUNNABLE 状态,然后我做了这些:

  1. 当我将实例类型交换为普通 CPU 类型时,重新部署 CF 堆栈,提交作业,该作业可以正常运行并成功,因此我在 AWS Batch 上使用这些 GPU 实例类型的方式一定是缺少/错误的;
  2. 然后我发现了 this post,所以我在我的 ComputeEnvironment 中添加了一个
    ImageId
    字段,其中包含已知的 GPU 优化 AMI,但仍然没有运气;
  3. 我通过运行
    aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2
    对工作 CPU AWS Batch 和非工作 GPU AWS Batch 之间的作业进行了并排比较,我发现它们之间缺少的是:
    containerInstanceArn
    taskArn
    在非工作中-工作 GPU 实例,这两个字段只是缺失。
  4. 我发现在计算环境创建的ASG(Auto Scaling Group)中,这个GPU实例就在这个ASG中,但是当我去ECS,选择这个GPU集群时,没有与之关联的容器实例,不像工作中的那样CPU 的,ECS 集群内有容器实例。

任何如何解决此问题的想法将不胜感激!

amazon-web-services amazon-ec2 gpu amazon-ecs aws-batch
1个回答
0
投票

这无疑是一个很好的学习,这是我所做的、发现并解决这个问题的:

  1. 归结为我新启动的GPU实例无法加入ECS集群的问题(都是通过上述yaml CloudFormation模板启动的);
  2. 首先进行一些检查:vpc、子网、安全组,以查看是否有任何内容阻止/阻止新 GPU 实例加入 ECS 集群;
  3. 在此处完成故障排除步骤:https://repost.aws/knowledge-center/batch-job-stuck-runnable-status
  4. 在上面的链接中,有一个
    AWSSupport-TroubleshootAWSBatchJob runbook
    ,在运行之前证明它很有帮助(确保你选择了正确的区域);
  5. 连接到您的 GPU 实例并安装 ECS 日志收集器:https://github.com/aws/amazon-ecs-logs-collector
  6. 检查你的日志,在这里,我发现了问题:
30T01:19:48Z msg="Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch"
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal systemd[1]: ecs.service: control process exited, code=exited status=255
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal kernel: NVRM: API mismatch: the client has the version 535.161.07, but
                                                                   NVRM: this kernel module has the version 470.182.03.  Please
                                                                   NVRM: make sure that this kernel module and all NVIDIA driver
                                                                   NVRM: components have the same version.
  1. 因此,不知何故,我的 cdk 不知道如何引入针对 GPU 实例优化的最新 AMI(理论上,它应该根据 aws doc)导致版本不匹配,然后我去了 https://github .com/aws/amazon-ecs-ami/releases 查找最新的 AMI ID:ami-019d947e77874eaee,然后添加此字段:
    ImageId: ami-019d947e77874eaee
    在我的模板中,重新部署,它就成功了!
© www.soinside.com 2019 - 2024. All rights reserved.