AWS ECS:OutOFMemoryError:容器因内存使用而被终止

问题描述 投票:0回答:1

我正在将 AWS BAtch 作业与 ECS 和 EC2 结合使用。 这是我正在使用的无服务器模板:

Description: >
Setup
  AWS Batch Compute Enviornment
  AWS Batch Job Definition
  AWS Batch Queue

Parameters: 
ComputeEnvironmentName:
Description: The batch compute enviornment name to use
Type: String
PlatformVpcStackName:
Description: VPC stack name for export subnets
Type: String
JobDefinitionName:
Description: Batch Job Definition name
Type: String
ECRRepositoryName:
Description: ECR Repository Name
Type: String
JobQueueName:
Description: Batch Job Queue Name
Type: String
EnvironmentName:
Description: Environment name e.g. dev, staging, prod
Type: String
Default: dev
ComponentName:
Description: A name where this resource is created for
Type: String
PartName:
Description: The name of the component's part
Type: String

Resources:
ASGLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
  LaunchTemplateData:
    MetadataOptions:
      HttpEndpoint: enabled
      HttpTokens: required
  TagSpecifications:
    - ResourceType: launch-template
      Tags:
      - Key: STAGE
        Value: !Ref EnvironmentName
      - Key: COMPONENT_NAME
        Value: !Ref ComponentName
      - Key: PART_NAME
        Value: !Ref PartName
      - Key: StackID
        Value: !Ref 'AWS::StackId'

tttCommonParserBatchComputeEnvironment:
Type: 'AWS::Batch::ComputeEnvironment'
Properties:
  ComputeEnvironmentName: !Ref ComputeEnvironmentName
  ComputeResources:
    MaxvCpus: 2
    MinvCpus: 0
    InstanceRole: !GetAtt tttHandlerEcsInstanceProfile.Arn
    InstanceTypes:
      - a1.medium
    DesiredvCpus: 2
    LaunchTemplate:
      LaunchTemplateId: !Ref ASGLaunchTemplate
    Tags:
      'STAGE' : !Ref EnvironmentName
      'COMPONENT_NAME' : !Ref ComponentName
      'PART_NAME' : !Ref PartName
      'StackID' : !Ref 'AWS::StackId'
  State: ENABLED
  Type: MANAGED
  Tags:
    'STAGE' : !Ref EnvironmentName
    'COMPONENT_NAME' : !Ref ComponentName
    'PART_NAME' : !Ref PartName
    'StackID' : !Ref 'AWS::StackId'

BatchJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
  Type: container
  JobDefinitionName: !Ref JobDefinitionName
  RetryStrategy: 
    Attempts: 1
  Timeout:
    AttemptDurationSeconds: 300
  ContainerProperties:
    JobRoleArn: !Ref tttHandlerJobServiceRole
    Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/testing
    Vcpus: 1
    Memory: 256
  Tags: {
    "STAGE" :  !Ref EnvironmentName,
    "COMPONENT_NAME" : !Ref ComponentName,
    "PART_NAME" : !Ref PartName,
    "StackID" : !Ref 'AWS::StackId'
    }

BatchJobQueue: 
Type: "AWS::Batch::JobQueue"
Properties:
  JobQueueName: !Ref JobQueueName
  Priority: 1
  State: ENABLED
  ComputeEnvironmentOrder:
    - Order: 1
      ComputeEnvironment: !Ref tttCommonParserBatchComputeEnvironment
  Tags:
    'STAGE' : !Ref EnvironmentName
    'COMPONENT_NAME' : !Ref ComponentName
    'PART_NAME' : !Ref PartName
    'StackID' : !Ref 'AWS::StackId'

tttCommonParserEcsInstanceRole:
Type: AWS::IAM::Role
Properties:
  RoleName: !Join ["-", [!Ref EnvironmentName, !Ref ComponentName, !Ref PartName, "ttt-common-parser-ecsInstanceRole"] ]
  AssumeRolePolicyDocument:
    Statement:
      - Action: ['sts:AssumeRole']
        Effect: Allow
        Principal:
          Service: [ec2.amazonaws.com]
    Version: '2012-10-17'
  Path: /        
  ManagedPolicyArns:
    - "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
  Tags:
    - Key: STAGE
      Value: !Ref EnvironmentName
    - Key: COMPONENT_NAME
      Value: !Ref ComponentName
    - Key: PART_NAME
      Value: !Ref PartName
    - Key: StackID
      Value: !Ref 'AWS::StackId'

tttHandlerEcsInstanceProfile:
  Type: "AWS::IAM::InstanceProfile"
  Properties:
    Path: "/"
    Roles:
      - !Ref tttCommonParserEcsInstanceRole

tttHandlerJobServiceRole:
Type: AWS::IAM::Role
Properties:
  RoleName: !Join ["-", [!Ref EnvironmentName, !Ref ComponentName, !Ref PartName, "ttt-common-parser-jobServiceRole"] ]
  AssumeRolePolicyDocument:
    Statement:
      - Action: ['sts:AssumeRole']
        Effect: Allow
        Principal:
          Service: [ecs-tasks.amazonaws.com]
    Version: '2012-10-17'
  Path: /        

当我运行单个批处理作业时,它运行时没有任何问题。但是,如果我运行多个批处理作业 (2-3),则其中一些作业会失败并出现错误:AWS ECS:OutOFMemoryError:容器因内存使用而被终止

为了清楚起见,我参考了这篇文章:https://repost.aws/knowledge-center/ecs-resolve-outofmemory-errors#:%7E:text=To%20troubleshoot%20OutOfMemory%20errors%20in,occurr%20due% 20%20内存%20使用量

通过阅读这篇文章,我了解到如果容器使用的内存多于主机,那么我会收到错误。但是,我使用的是 a1.medium ec2 实例,它有 2GB 内存和 1 个 vCPU,而我的 cotnainer 使用 1 个 vCPU 和 256MB 内存。所以这个问题不应该出现。我研究了 ECS 指标。这证实了内存没有问题。

更新1:

在深入研究和使用 ECS 和 Batch 后,我意识到以下几点:

  1. 我的所有任务都需要 136MB 内存。
  2. 当存在多个批处理作业/任务时,每个任务将重复使用单个容器而不是新容器。因此,当在同一容器中启动第二个任务时,其内存已使用 136MB,因此现在所需的总内存:136 + 136 = 272MB,这超出了允许的范围(256MB)。

如果可能的话,我正在考虑以下方法来解决它:

  1. 对于每个任务,应该实例化 1 个容器,而不是重复使用旧容器,即使它们并行运行也是如此。
  2. 容器的内存应该根据负载动态增加。 EC2 Auto Scaling 组的一些东西。

还不知道如何实现它们。

amazon-web-services amazon-ec2 amazon-ecs batch-processing
1个回答
0
投票

这是一个当作业因内存错误而被终止时自动增加内存的示例。它使用 AWS Step Functions 来编排 Batch 作业:https://github.com/ivansabik/aws-step-functions-batch-memory-auto-scaling-example

增加内存重试并不是 Batch 的原生功能,不过如果他们添加了这一功能那就太好了。

© www.soinside.com 2019 - 2024. All rights reserved.