我正在将 AWS BAtch 作业与 ECS 和 EC2 结合使用。 这是我正在使用的无服务器模板:
Description: >
Setup
AWS Batch Compute Enviornment
AWS Batch Job Definition
AWS Batch Queue
Parameters:
ComputeEnvironmentName:
Description: The batch compute enviornment name to use
Type: String
PlatformVpcStackName:
Description: VPC stack name for export subnets
Type: String
JobDefinitionName:
Description: Batch Job Definition name
Type: String
ECRRepositoryName:
Description: ECR Repository Name
Type: String
JobQueueName:
Description: Batch Job Queue Name
Type: String
EnvironmentName:
Description: Environment name e.g. dev, staging, prod
Type: String
Default: dev
ComponentName:
Description: A name where this resource is created for
Type: String
PartName:
Description: The name of the component's part
Type: String
Resources:
ASGLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
MetadataOptions:
HttpEndpoint: enabled
HttpTokens: required
TagSpecifications:
- ResourceType: launch-template
Tags:
- Key: STAGE
Value: !Ref EnvironmentName
- Key: COMPONENT_NAME
Value: !Ref ComponentName
- Key: PART_NAME
Value: !Ref PartName
- Key: StackID
Value: !Ref 'AWS::StackId'
tttCommonParserBatchComputeEnvironment:
Type: 'AWS::Batch::ComputeEnvironment'
Properties:
ComputeEnvironmentName: !Ref ComputeEnvironmentName
ComputeResources:
MaxvCpus: 2
MinvCpus: 0
InstanceRole: !GetAtt tttHandlerEcsInstanceProfile.Arn
InstanceTypes:
- a1.medium
DesiredvCpus: 2
LaunchTemplate:
LaunchTemplateId: !Ref ASGLaunchTemplate
Tags:
'STAGE' : !Ref EnvironmentName
'COMPONENT_NAME' : !Ref ComponentName
'PART_NAME' : !Ref PartName
'StackID' : !Ref 'AWS::StackId'
State: ENABLED
Type: MANAGED
Tags:
'STAGE' : !Ref EnvironmentName
'COMPONENT_NAME' : !Ref ComponentName
'PART_NAME' : !Ref PartName
'StackID' : !Ref 'AWS::StackId'
BatchJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
Type: container
JobDefinitionName: !Ref JobDefinitionName
RetryStrategy:
Attempts: 1
Timeout:
AttemptDurationSeconds: 300
ContainerProperties:
JobRoleArn: !Ref tttHandlerJobServiceRole
Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/testing
Vcpus: 1
Memory: 256
Tags: {
"STAGE" : !Ref EnvironmentName,
"COMPONENT_NAME" : !Ref ComponentName,
"PART_NAME" : !Ref PartName,
"StackID" : !Ref 'AWS::StackId'
}
BatchJobQueue:
Type: "AWS::Batch::JobQueue"
Properties:
JobQueueName: !Ref JobQueueName
Priority: 1
State: ENABLED
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref tttCommonParserBatchComputeEnvironment
Tags:
'STAGE' : !Ref EnvironmentName
'COMPONENT_NAME' : !Ref ComponentName
'PART_NAME' : !Ref PartName
'StackID' : !Ref 'AWS::StackId'
tttCommonParserEcsInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Join ["-", [!Ref EnvironmentName, !Ref ComponentName, !Ref PartName, "ttt-common-parser-ecsInstanceRole"] ]
AssumeRolePolicyDocument:
Statement:
- Action: ['sts:AssumeRole']
Effect: Allow
Principal:
Service: [ec2.amazonaws.com]
Version: '2012-10-17'
Path: /
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
Tags:
- Key: STAGE
Value: !Ref EnvironmentName
- Key: COMPONENT_NAME
Value: !Ref ComponentName
- Key: PART_NAME
Value: !Ref PartName
- Key: StackID
Value: !Ref 'AWS::StackId'
tttHandlerEcsInstanceProfile:
Type: "AWS::IAM::InstanceProfile"
Properties:
Path: "/"
Roles:
- !Ref tttCommonParserEcsInstanceRole
tttHandlerJobServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Join ["-", [!Ref EnvironmentName, !Ref ComponentName, !Ref PartName, "ttt-common-parser-jobServiceRole"] ]
AssumeRolePolicyDocument:
Statement:
- Action: ['sts:AssumeRole']
Effect: Allow
Principal:
Service: [ecs-tasks.amazonaws.com]
Version: '2012-10-17'
Path: /
当我运行单个批处理作业时,它运行时没有任何问题。但是,如果我运行多个批处理作业 (2-3),则其中一些作业会失败并出现错误:AWS ECS:OutOFMemoryError:容器因内存使用而被终止
通过阅读这篇文章,我了解到如果容器使用的内存多于主机,那么我会收到错误。但是,我使用的是 a1.medium ec2 实例,它有 2GB 内存和 1 个 vCPU,而我的 cotnainer 使用 1 个 vCPU 和 256MB 内存。所以这个问题不应该出现。我研究了 ECS 指标。这证实了内存没有问题。
更新1:
在深入研究和使用 ECS 和 Batch 后,我意识到以下几点:
如果可能的话,我正在考虑以下方法来解决它:
还不知道如何实现它们。
这是一个当作业因内存错误而被终止时自动增加内存的示例。它使用 AWS Step Functions 来编排 Batch 作业:https://github.com/ivansabik/aws-step-functions-batch-memory-auto-scaling-example
增加内存重试并不是 Batch 的原生功能,不过如果他们添加了这一功能那就太好了。