Flink 检查点因延迟相关错误而停顿和超时

问题描述 投票:0回答:1

最近,我升级了现有的 Flink 作业(之前运行 Flink 1.15)以针对官方 Flink Kubernetes Operator(针对 Flink 1.18)运行,并开始看到一些围绕检查点的奇怪行为,这些行为以前在其他作业中未曾见过(或迁移前的工作)。

有关作业和部署的一些附加信息(仅包括可能相关的部分):

spec:
  flinkConfiguration:
    # Checkpointing Configuration
    execution.checkpointing.interval: 10s
    execution.checkpointing.mode: EXACTLY_ONCE
    execution.checkpointing.tolerable-failed-checkpoints: "5"
    execution.checkpointing.unaligned: "true"
    ...
    # High Availability
    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    high-availability.storageDir: gs://.../my-job/ha
    # Autoscaling (with in-place upgrades and default autoscaling configuration)
    job.scheduler: adaptive
    kubernetes.operator.job.autoscaler.enabled: "true"
    kubernetes.operator.job.autoscaler.catch-up.duration: 5m
    kubernetes.operator.job.autoscaler.metrics.window: 3m
    kubernetes.operator.job.autoscaler.restart.time: 2m
    kubernetes.operator.job.autoscaler.stabilization.interval: 1m
    kubernetes.operator.job.autoscaler.target.utilization: "0.6"
    kubernetes.operator.job.autoscaler.target.utilization.boundary: "0.2"
    kubernetes.operator.job.autoscaler.vertex.max-parallelism: "8"
    ...
    # Checkpointing / Savepointing Configuration
    state.checkpoints.dir: gs://.../my-job/checkpoints
    state.savepoints.dir: gs://.../my-job/savepoints
  podTemplate:
    ...
    spec:
      ...
      containers:
      # Configuration for writing to GCS for Checkpoints/Savepoints/HA
      - env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /flink/gcs-creds/key.json
        - name: ENABLE_BUILT_IN_PLUGINS
          value: flink-gs-fs-hadoop-1.18.1.jar

我遇到的情况是,作业开始运行后,它会尝试检查点,然后停止,直到超时。该作业配置为每 10 秒检查一次,但在前一个检查点完全超时(10 分钟)之前,它不会继续或尝试下一个检查点,类似于下面所示的检查点历史记录:

在查看任务管理器日志时,我看到了很多我以前从未见过的延迟警告,我怀疑它们与检查点问题有关:

Apr 15, 2024 3:44:20 PM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats
INFO: Detected potential high latency for operation op_create. latencyMs=695; previousMaxLatencyMs=0; operationCount=3; context=gs://.../my-job/checkpoints/93adc4ebde39a17486133f0
...
Apr 15, 2024 4:04:19 PM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats
INFO: Detected potential high latency for operation stream_write_close_operations. 
latencyMs=247; previousMaxLatencyMs=246; operationCount=7; context=gs://...my-job/checkpoints/9
...
Apr 15, 2024 4:04:20 PM com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem repairImplicitDirectory
INFO: Successfully repaired 'gs://.../my-job/checkpoints/93adc4ebde39a17486133f0c3ef3f508/chk-3855/' directory.

我不确定这是 1.18 Hadoop 插件的问题还是与作业本身的配置相关的问题,但它不断失败,我不确定如何查明确切的原因。

hadoop apache-flink flink-streaming flink-checkpoint flink-kubernetes-operator
1个回答
0
投票

经过一些重大的挖掘和调整工作后,我终于能够让它按预期运行。该问题似乎与使用未对齐的检查点有关,因为禁用以下配置对工作产生了直接影响:

execution.checkpointing.unaligned: "false"

问题本身似乎是 Flink 1.18 版本中引入的回归,因为我运行的其他几个作业利用未对齐的检查点,但目标是早期版本的 Flink (1.17.x)。我没有运气将问题定位到 Apache JIRA 中的特定已知问题,但如果更熟悉该问题的人能够详细说明,我会很高兴。

© www.soinside.com 2019 - 2024. All rights reserved.