我正在运行 Kubernetes 作业,其中 pod 因某些未知原因多次终止和重新创建。我假设 Pod 是由于某种驱逐过程而终止的,因为终止同时发生在所有 Pod 和所有作业中。我正在寻找一种方法来调试导致这些 Pod 终止的原因。
以下是我运行的job manifest的例子:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"generateName": "job-",
"namespace": "default"
},
"spec": {
"backoffLimit": 0,
"template": {
"spec": {
"containers": [
{
"command": [
"/bin/sh"
],
"image": "******",
"name": "x",
"resources": {
"limits": {
"cpu": 2,
"memory": "4G"
},
"requests": {
"cpu": 2,
"memory": "4G"
}
}
}
],
"restartPolicy": "Never"
}
},
"ttlSecondsAfterFinished": 600
}
}
我想使用
kubectl describe pod
和 kubectl logs
来确定导致 pod 终止的原因。但是,终止后,pod 会立即被删除,无法使用上述命令进行检查。
我已经检查了
kubectl get events
,试图确定 pod 被终止的原因。然而,输出提供的信息很少:
5m16s Normal Created pod/job-q4v5l-vxtgg Created container x
5m15s Normal Started pod/job-q4v5l-vxtgg Started container x
5m15s Normal Killing pod/job-q4v5l-vxtgg Stopping container x
kubectl describe job
命令显示以下事件。从这个输出可以看出,一个pod被重复创建了
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 6m38s job-controller Created pod: job-q4v5l-7trcd
Normal SuccessfulCreate 6m34s job-controller Created pod: job-q4v5l-zzw27
Normal SuccessfulCreate 6m33s job-controller Created pod: job-q4v5l-4crzq
Normal SuccessfulCreate 6m31s job-controller Created pod: job-q4v5l-sjbdh
Normal SuccessfulCreate 6m28s job-controller Created pod: job-q4v5l-fhz2x
Normal SuccessfulCreate 6m25s job-controller Created pod: job-q4v5l-6vgg5
Normal SuccessfulCreate 6m22s job-controller Created pod: job-q4v5l-7dmh4
Normal SuccessfulCreate 6m19s job-controller Created pod: job-q4v5l-klf4q
Normal SuccessfulCreate 6m15s job-controller Created pod: job-q4v5l-87vwx
Normal SuccessfulCreate 5m32s (x16 over 6m12s) job-controller (combined from similar events): Created pod: job-q4v5l-6x5pv
正如 Shahar Azulay 在 blog 中解释的那样:
Pod 最终进入 Failed 状态的原因有很多 到不成功的容器终止。常见的根本原因包括 无法拉取容器镜像,因为它不可用,错误在 Pod 的 YAML 中的应用程序代码或错误配置。但简单地说 知道 Pod 失败并不意味着你会知道原因 失败。除非你深入挖掘,否则你唯一知道的是 它处于失败状态。
深入挖掘的一种方法是查看容器退出代码。容器 退出代码是数字代码,给出了一个名义上的原因 容器停止工作。您可以在中获取容器的退出代码 运行 Pod
kubectl get pod termination-demo
我调整了你的 yaml,用
busybox
代替,以模拟你在做什么:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"generateName": "job-",
"namespace": "default"
},
"spec": {
"backoffLimit": 0,
"template": {
"spec": {
"containers": [
{
"command": [
"/bin/sh"
],
"image": "busybox",
"name": "x",
"resources": {
"limits": {
"cpu": 2,
"memory": "4G"
},
"requests": {
"cpu": 2,
"memory": "4G"
}
}
}
],
"restartPolicy": "Never"
}
},
"ttlSecondsAfterFinished": 600
}
}
这样创建了一个pod并成功退出
$ kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
job-vn8mc-jnpzz 0/1 Completed 0 3m34s
我没有像你说的那样让任何豆荚消失。
我的
kubectl describe job
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 4m49s job-controller Created pod: job-vn8mc-jnpzz
Normal Completed 3m8s job-controller Job completed
我的
kubectl get events
:
4m10s Normal Created pod/job-vn8mc-jnpzz Created container x
4m10s Normal Started pod/job-vn8mc-jnpzz Started container x
5m47s Normal SuccessfulCreate job/job-vn8mc Created pod: job-vn8mc-jnpzz
4m6s Normal Completed job/job-vn8mc Job completed
与你的比较:
5m16s Normal Created pod/job-q4v5l-vxtgg Created container x
5m15s Normal Started pod/job-q4v5l-vxtgg Started container x
5m15s Normal Killing pod/job-q4v5l-vxtgg Stopping container x
这告诉我你的工作正在尝试创建 pod,pod 未能成功完成,工作正在重试然后放弃。
所以,我已将您的工作转换为单个 pod yaml:
apiVersion: v1
kind: Pod
metadata:
name: job-as-pod
namespace: default
spec:
containers:
- command:
- /bin/sh
image: *******
imagePullPolicy: Always
name: x
restartPolicy: Never
运行它,它应该创建一个 pod
job-as-pod
将完成:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job-as-pod 0/1 Completed 0 2m15s
或失败
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job-as-pod 0/1 Error 0 12s
我希望如果你在这里插入你的图像,它会出错。然后你可以调试确切的错误。