Kubernetes Cronjob - Prometheus 警报未触发

问题描述 投票:0回答:0

我已经为一个以非零代码退出的容器设置了一个 Cronjob。我针对此 CronJob 设置了 PrometheusRule 和 AlertManagerConfig,但警报未按预期触发。

这是 CronJob 定义:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: exitjob
  namespace: monitoring
spec:
  schedule: "*/4 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - command:
                - sh
                - exit
                - "1"
              image: alpine
              imagePullPolicy: Always
              name: main
          restartPolicy: Never
          terminationGracePeriodSeconds: 30

这就是普罗米修斯规则:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-job-alert
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring

        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job, namespace) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
        - alert: CronJobStatusFailed
          expr: |
            job_cronjob:kube_job_status_failed:sum
            * ON(job, namespace) GROUP_RIGHT()
            kube_cronjob_labels
            > 0
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

以及关联的AlertManagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: cronjob-failure-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: cron-email
    routes:
      - matchers:
        - name: job
          value: cron-failure
        receiver: cron-email
  receivers:
    - name: cron-email
      emailConfigs:
        - to: '[email protected]'
          from: '[email protected]'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

我检查了AlertManager日志,但根本没有错误。它只是显示配置重新加载事件:

ts=2023-07-24T13:30:44.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T13:30:44.218Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.721Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.730Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.229Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml\

在 PrometheusUI 中,警报显示为非活动状态:

我错过了什么?

kubernetes alert kubernetes-cronjob kube-prometheus-stack
© www.soinside.com 2019 - 2024. All rights reserved.