Fluent-bit 正在截断并丢弃 java 应用程序日志

问题描述 投票:0回答:1

我正在尝试使用 Fluent-bit 将日志从 AWS EKS 发送到 AWS Cloudwatch。我的 fluid-bit 配置总体上可以正常工作,并且大多数日志都会发送到 CloudWatch,但较大的日志会出现问题。一开始它开始截断日志,最终丢弃日志记录。

[2023/12/05 14:10:54] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=77603133 watch_fd=5
[2023/12/05 14:30:14] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=294805] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:19] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=364675] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:24] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=523322] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:29] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] Discarding massive log record
[2023/12/05 14:30:29] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=836367] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:34] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=554279] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:39] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=731990] Truncating event which is larger than max size allowed by CloudWatch
[2023/12/05 14:30:39] [ warn] [output:cloudwatch_logs:cloudwatch_logs.0] [size=793332] Truncating event which is larger than max size allowed by CloudWatch

这是我当前的设置:

Fluent-bit.conf

  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               ${HTTP_SERVER}
        HTTP_Listen               0.0.0.0
        HTTP_Port                 ${HTTP_PORT}
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.max_chunks_up     128
        storage.backlog.mem_limit 5M
        scheduler.cap             30

    @INCLUDE application-log.conf
    @INCLUDE dataplane-log.conf
    @INCLUDE host-log.conf

应用程序日志.conf

[INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*
        Path                /var/log/containers/*.log
        multiline.parser    java, docker, cri
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     Off
        Refresh_Interval    5
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/cloudwatch-agent*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_cwagent.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     Off
        Refresh_Interval    30
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              On
        Annotations         Off
    
    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
        log_stream_name     $(kubernetes['container_name'])-$(kubernetes['pod_name'])
        log_stream_template $kubernetes['container_name'].$kubernetes['pod_name']
        auto_create_group   On
        Retry_Limit         False

解析器.conf

  parsers.conf: |
    [MULTILINE_PARSER]
        name                multiline-regex
        type                regex
        flush_timeout       5
        # rules |   state name  | regex pattern                  | next state
        # ------|---------------|--------------------------------------------
        rule      "start_state"   "/(Dec \d+ \d+\:\d+\:\d+)(.*)/"  "cont"
        rule      "cont"          "/^\s+at.*/"                     "cont"
    
    [PARSER]
        Name                docker
        Format              json
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

    [PARSER]
        Name                container_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cwagent_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

我尝试过的:

我尝试使用插件

cloudwatch
cloudwatch_logs
发送日志。但不幸的是,我在两者上都给出了类似的错误。我还使用并尝试使用内置的 java 解析器和几个多行解析器,但没有运气。通常,pod 会运行一段时间,然后开始出现这些错误。所以我可能会假设某些东西可能与缓冲内存或其他东西有关?

欢迎任何想法或建议。

fluent amazon-cloudwatchlogs fluent-bit
1个回答
0
投票

现在我已经解决了这个问题。

解析器运行良好。该错误是相关的,我们的一个 pod 每半小时就会输出大量日志。(14:30、15:30、16:30 等..)

虽然我发现了错误,但很难弄清楚哪条日志行到底被截断,因为有很多传入的日志。为了解决这个问题,我必须使用 lua 脚本。

这是我的解决方案,打印出被截断或丢弃的日志。

fluent-bit.conf: |
[SERVICE]
    Daemon                    Off
    Flush                     2
    Grace                     5
    Log_Level                 warn
    Parsers_File              parsers.conf
    HTTP_Server               ${HTTP_SERVER}
    HTTP_Listen               0.0.0.0
    HTTP_Port                 ${HTTP_PORT}
    storage.path              /var/fluent-bit/state/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.max_chunks_up     128
    storage.backlog.mem_limit 5M
    scheduler.cap             30

@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf

get-size.lua: |
      function cb_print(tag, timestamp, record)
          if record["log"] ~= nil then
              log_size = string.len(record["log"])
              record["log_size"] = log_size
              -- Print out all logs that are bigger than 255000 bytes
              if log_size > 255000 then
                  print(record["log"])
              end
          end
          return 1, timestamp, record
      end

application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*
        Path                /var/log/containers/*.log
        multiline.parser    java, docker, cri
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     Off
        Refresh_Interval    5
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/cloudwatch-agent*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_cwagent.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     Off
        Refresh_Interval    30
        Read_from_Head      ${READ_FROM_HEAD}
    
    [FILTER]
        Name                lua
        Match               *
        script              get-size.lua
        call                cb_print

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              On
        Annotations         Off
    
    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
        log_stream_name     $(kubernetes['container_name'])-$(kubernetes['pod_name'])
        log_stream_template $kubernetes['container_name'].$kubernetes['pod_name']
        auto_create_group   On
        Retry_Limit         False

希望这可以帮助某人节省大量时间。因为它对我来说:)

© www.soinside.com 2019 - 2024. All rights reserved.