我需要在 AWS Batch 上设置 GPU 支持的实例。
这是我的 .yaml 文件:
GPULargeLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
UserData:
Fn::Base64:
Fn::Sub: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
--==BOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
runcmd:
- yum install -y aws-cfn-bootstrap
- echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
- echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
- echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
- /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
- echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
- echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
- echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
- echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
- /usr/bin/docker-storage-setup
- yum update -y
- echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
- /etc/init.d/docker restart
--==BOUNDARY==--
LaunchTemplateName: GPULargeLaunchTemplate
GPULargeBatchComputeEnvironment:
DependsOn:
- ComputeRole
- ComputeInstanceProfile
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeResources:
ImageId: ami-GPU-optimized-AMI-ID
AllocationStrategy: BEST_FIT_PROGRESSIVE
LaunchTemplate:
LaunchTemplateId:
Ref: GPULargeLaunchTemplate
Version:
Fn::GetAtt:
- GPULargeLaunchTemplate
- LatestVersionNumber
InstanceRole:
Ref: ComputeInstanceProfile
InstanceTypes:
- g4dn.xlarge
MaxvCpus: 768
MinvCpus: 1
SecurityGroupIds:
- Fn::GetAtt:
- ComputeSecurityGroup
- GroupId
Subnets:
- Ref: ComputePrivateSubnetA
Type: EC2
UpdateToLatestImageVersion: True
MyGPUBatchJobQueue:
Type: AWS::Batch::JobQueue
Properties:
ComputeEnvironmentOrder:
- ComputeEnvironment:
Ref: GPULargeBatchComputeEnvironment
Order: 1
Priority: 5
JobQueueName: MyGPUBatchJobQueue
State: ENABLED
MyGPUJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
Type: container
ContainerProperties:
Command:
- "/opt/bin/python3"
- "/opt/bin/start.py"
- "--retry_count"
- "Ref::batchRetryCount"
- "--retry_limit"
- "Ref::batchRetryLimit"
Environment:
- Name: "Region"
Value: "us-west-2"
- Name: "LANG"
Value: "en_US.UTF-8"
Image:
Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
JobRoleArn:
Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
Memory: 16000
Vcpus: 1
ResourceRequirements:
- Type: GPU
Value: '1'
JobDefinitionName: MyGPUJobDefinition
Timeout:
AttemptDurationSeconds: 500
当我开始一项工作时,该工作永远停留在 RUNNABLE 状态,然后我做了这些:
ImageId
字段,其中包含已知的 GPU 优化 AMI,但仍然没有运气;aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2
对工作 CPU AWS Batch 和非工作 GPU AWS Batch 之间的作业进行了并排比较,我发现它们之间缺少的是: containerInstanceArn
和 taskArn
在非工作中-工作 GPU 实例,这两个字段只是缺失。任何如何解决此问题的想法将不胜感激!
这无疑是一个很好的学习,这是我所做的、发现并解决这个问题的:
AWSSupport-TroubleshootAWSBatchJob runbook
,在运行之前证明它很有帮助(确保你选择了正确的区域);30T01:19:48Z msg="Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch"
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal systemd[1]: ecs.service: control process exited, code=exited status=255
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal kernel: NVRM: API mismatch: the client has the version 535.161.07, but
NVRM: this kernel module has the version 470.182.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
ImageId: ami-019d947e77874eaee
在我的模板中,重新部署,它就成功了!