没有与 AWS Batch 关联的 GPU EC2 实例

Question

我需要在 AWS Batch 上设置 GPU 支持的实例。

这是我的 .yaml 文件：

  GPULargeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64:
            Fn::Sub: |
              MIME-Version: 1.0
              Content-Type: multipart/mixed; boundary="==BOUNDARY=="

              --==BOUNDARY==
              Content-Type: text/cloud-config; charset="us-ascii"

              runcmd:
                - yum install -y aws-cfn-bootstrap
                - echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
                - /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
                - echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
                - echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
                - echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
                - echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
                - /usr/bin/docker-storage-setup
                - yum update -y
                - echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
                - /etc/init.d/docker restart

              --==BOUNDARY==--
      LaunchTemplateName: GPULargeLaunchTemplate

  GPULargeBatchComputeEnvironment:
    DependsOn:
      - ComputeRole
      - ComputeInstanceProfile
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeResources:
        ImageId: ami-GPU-optimized-AMI-ID
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        LaunchTemplate:
          LaunchTemplateId:
            Ref: GPULargeLaunchTemplate
          Version:
            Fn::GetAtt:
              - GPULargeLaunchTemplate
              - LatestVersionNumber
        InstanceRole:
          Ref: ComputeInstanceProfile
        InstanceTypes:
          - g4dn.xlarge
        MaxvCpus: 768
        MinvCpus: 1
        SecurityGroupIds:
          - Fn::GetAtt:
              - ComputeSecurityGroup
              - GroupId
        Subnets:
          - Ref: ComputePrivateSubnetA
        Type: EC2
        UpdateToLatestImageVersion: True

  MyGPUBatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      ComputeEnvironmentOrder:
        - ComputeEnvironment:
            Ref: GPULargeBatchComputeEnvironment
          Order: 1
      Priority: 5
      JobQueueName: MyGPUBatchJobQueue
      State: ENABLED

  MyGPUJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      ContainerProperties:
        Command:
          - "/opt/bin/python3"
          - "/opt/bin/start.py"
          - "--retry_count"
          - "Ref::batchRetryCount"
          - "--retry_limit"
          - "Ref::batchRetryLimit"
        Environment:
          - Name: "Region"
            Value: "us-west-2"
          - Name: "LANG"
            Value: "en_US.UTF-8"
        Image:
          Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
        JobRoleArn:
          Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
        Memory: 16000
        Vcpus: 1
        ResourceRequirements:
          - Type: GPU
            Value: '1'
      JobDefinitionName: MyGPUJobDefinition
      Timeout:
        AttemptDurationSeconds: 500

当我开始一项工作时，该工作永远停留在 RUNNABLE 状态，然后我做了这些：

当我将实例类型交换为普通 CPU 类型时，重新部署 CF 堆栈，提交作业，该作业可以正常运行并成功，因此我在 AWS Batch 上使用这些 GPU 实例类型的方式一定是缺少/错误的;
然后我发现了 this post，所以我在我的 ComputeEnvironment 中添加了一个
```
ImageId
```
字段，其中包含已知的 GPU 优化 AMI，但仍然没有运气；
我通过运行
```
aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2
```
对工作 CPU AWS Batch 和非工作 GPU AWS Batch 之间的作业进行了并排比较，我发现它们之间缺少的是：
```
containerInstanceArn
```
和
```
taskArn
```
在非工作中-工作 GPU 实例，这两个字段只是缺失。
我发现在计算环境创建的ASG（Auto Scaling Group）中，这个GPU实例就在这个ASG中，但是当我去ECS，选择这个GPU集群时，没有与之关联的容器实例，不像工作中的那样CPU 的，ECS 集群内有容器实例。

任何如何解决此问题的想法将不胜感激！

Answer 1

这无疑是一个很好的学习，这是我所做的、发现并解决这个问题的：

归结为我新启动的GPU实例无法加入ECS集群的问题（都是通过上述yaml CloudFormation模板启动的）；
首先进行一些检查：vpc、子网、安全组，以查看是否有任何内容阻止/阻止新 GPU 实例加入 ECS 集群；
在此处完成故障排除步骤：https://repost.aws/knowledge-center/batch-job-stuck-runnable-status
在上面的链接中，有一个
```
AWSSupport-TroubleshootAWSBatchJob runbook
```
，在运行之前证明它很有帮助（确保你选择了正确的区域）；
连接到您的 GPU 实例并安装 ECS 日志收集器：https://github.com/aws/amazon-ecs-logs-collector
检查你的日志，在这里，我发现了问题：

30T01:19:48Z msg="Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch"
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal systemd[1]: ecs.service: control process exited, code=exited status=255
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal kernel: NVRM: API mismatch: the client has the version 535.161.07, but
                                                                   NVRM: this kernel module has the version 470.182.03.  Please
                                                                   NVRM: make sure that this kernel module and all NVIDIA driver
                                                                   NVRM: components have the same version.

因此，不知何故，我的 cdk 不知道如何引入针对 GPU 实例优化的最新 AMI（理论上，它应该根据 aws doc）导致版本不匹配，然后我去了 https://github .com/aws/amazon-ecs-ami/releases 查找最新的 AMI ID：ami-019d947e77874eaee，然后添加此字段：
```
ImageId: ami-019d947e77874eaee
```
在我的模板中，重新部署，它就成功了！

没有与 AWS Batch 关联的 GPU EC2 实例

问题描述投票：0回答：1

1个回答

最新问题

没有与 AWS Batch 关联的 GPU EC2 实例

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1