如何调试作业中意外终止的 Kubernetes pod?

问题描述 投票:0回答:2

我正在运行 Kubernetes 作业,其中 pod 因某些未知原因多次终止和重新创建。我假设 Pod 是由于某种驱逐过程而终止的,因为终止同时发生在所有 Pod 和所有作业中。我正在寻找一种方法来调试导致这些 Pod 终止的原因。

以下是我运行的job manifest的例子:

{
 "apiVersion": "batch/v1",
 "kind": "Job",
 "metadata": {
  "generateName": "job-",
  "namespace": "default"
 },
 "spec": {
  "backoffLimit": 0,
  "template": {
   "spec": {
    "containers": [
     {
      "command": [
       "/bin/sh"
      ],
      "image": "******",
      "name": "x",
      "resources": {
       "limits": {
        "cpu": 2,
        "memory": "4G"
       },
       "requests": {
        "cpu": 2,
        "memory": "4G"
       }
      }
     }
    ],
    "restartPolicy": "Never"
   }
  },
  "ttlSecondsAfterFinished": 600
 }
}

我想使用

kubectl describe pod
kubectl logs
来确定导致 pod 终止的原因。但是,终止后,pod 会立即被删除,无法使用上述命令进行检查。

我已经检查了

kubectl get events
,试图确定 pod 被终止的原因。然而,输出提供的信息很少:

5m16s       Normal    Created                pod/job-q4v5l-vxtgg   Created container x
5m15s       Normal    Started                pod/job-q4v5l-vxtgg   Started container x
5m15s       Normal    Killing                pod/job-q4v5l-vxtgg   Stopping container x

kubectl describe job
命令显示以下事件。从这个输出可以看出,一个pod被重复创建了

Events:
  Type    Reason            Age                     From            Message
  ----    ------            ----                    ----            -------
  Normal  SuccessfulCreate  6m38s                   job-controller  Created pod: job-q4v5l-7trcd
  Normal  SuccessfulCreate  6m34s                   job-controller  Created pod: job-q4v5l-zzw27
  Normal  SuccessfulCreate  6m33s                   job-controller  Created pod: job-q4v5l-4crzq
  Normal  SuccessfulCreate  6m31s                   job-controller  Created pod: job-q4v5l-sjbdh
  Normal  SuccessfulCreate  6m28s                   job-controller  Created pod: job-q4v5l-fhz2x
  Normal  SuccessfulCreate  6m25s                   job-controller  Created pod: job-q4v5l-6vgg5
  Normal  SuccessfulCreate  6m22s                   job-controller  Created pod: job-q4v5l-7dmh4
  Normal  SuccessfulCreate  6m19s                   job-controller  Created pod: job-q4v5l-klf4q
  Normal  SuccessfulCreate  6m15s                   job-controller  Created pod: job-q4v5l-87vwx
  Normal  SuccessfulCreate  5m32s (x16 over 6m12s)  job-controller  (combined from similar events): Created pod: job-q4v5l-6x5pv
kubernetes kubernetes-jobs
2个回答
0
投票

正如 Shahar Azulay 在 blog 中解释的那样:

Pod 最终进入 Failed 状态的原因有很多 到不成功的容器终止。常见的根本原因包括 无法拉取容器镜像,因为它不可用,错误在 Pod 的 YAML 中的应用程序代码或错误配置。但简单地说 知道 Pod 失败并不意味着你会知道原因 失败。除非你深入挖掘,否则你唯一知道的是 它处于失败状态。

深入挖掘的一种方法是查看容器退出代码。容器 退出代码是数字代码,给出了一个名义上的原因 容器停止工作。您可以在中获取容器的退出代码 运行 Pod

kubectl get pod termination-demo

请参阅此 doc 以获取有关 Pod 故障原因的更多信息以及此 doc 以调试 Pod。


0
投票

我调整了你的 yaml,用

busybox
代替,以模拟你在做什么:

{
 "apiVersion": "batch/v1",
 "kind": "Job",
 "metadata": {
  "generateName": "job-",
  "namespace": "default"
 },
 "spec": {
  "backoffLimit": 0,
  "template": {
   "spec": {
    "containers": [
     {
      "command": [
       "/bin/sh"
      ],
      "image": "busybox",
      "name": "x",
      "resources": {
       "limits": {
        "cpu": 2,
        "memory": "4G"
       },
       "requests": {
        "cpu": 2,
        "memory": "4G"
       }
      }
     }
    ],
    "restartPolicy": "Never"
   }
  },
  "ttlSecondsAfterFinished": 600
 }
}

这样创建了一个pod并成功退出

$ kubectl get pods -n default
NAME              READY   STATUS      RESTARTS   AGE
job-vn8mc-jnpzz   0/1     Completed   0          3m34s

我没有像你说的那样让任何豆荚消失。

我的

kubectl describe job

Events:
  Type    Reason            Age    From            Message
  ----    ------            ----   ----            -------
  Normal  SuccessfulCreate  4m49s  job-controller  Created pod: job-vn8mc-jnpzz
  Normal  Completed         3m8s   job-controller  Job completed

我的

kubectl get events

4m10s       Normal    Created                        pod/job-vn8mc-jnpzz                                        Created container x
4m10s       Normal    Started                        pod/job-vn8mc-jnpzz                                        Started container x
5m47s       Normal    SuccessfulCreate               job/job-vn8mc                                              Created pod: job-vn8mc-jnpzz
4m6s        Normal    Completed                      job/job-vn8mc                                              Job completed

与你的比较:

5m16s       Normal    Created                pod/job-q4v5l-vxtgg   Created container x
5m15s       Normal    Started                pod/job-q4v5l-vxtgg   Started container x
5m15s       Normal    Killing                pod/job-q4v5l-vxtgg   Stopping container x

这告诉我你的工作正在尝试创建 pod,pod 未能成功完成,工作正在重试然后放弃。

所以,我已将您的工作转换为单个 pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: job-as-pod
  namespace: default
spec:
  containers:
  - command:
    - /bin/sh
    image: *******
    imagePullPolicy: Always
    name: x
  restartPolicy: Never

运行它,它应该创建一个 pod

job-as-pod
将完成:

$ kubectl get pods
NAME         READY   STATUS      RESTARTS   AGE
job-as-pod   0/1     Completed   0          2m15s

或失败

$ kubectl get pods
NAME         READY   STATUS   RESTARTS   AGE
job-as-pod   0/1     Error    0          12s

我希望如果你在这里插入你的图像,它会出错。然后你可以调试确切的错误。

© www.soinside.com 2019 - 2024. All rights reserved.