我一直在尝试按照以下说明使用 Helm Chart('stable/horovod')在我的 K8s 集群中部署 horovod:stable/horovod
WARNING: This chart is deprecated
W0417 12:32:47.954840 1565257 warnings.go:70] unknown field "spec.template.spec.selector"
NAME: mnist
LAST DEPLOYED: Wed Apr 17 12:32:47 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
*** NOTE: It may take a few minutes for the statefulset to be available
*** you can watch the status of statefulset by running 'kubectl get sts --namespace default -w mnist-horovod' ***
当我执行时:
$ kubectl get pods
| NAME | READY | STATUS | RESTARTS | AGE
mnist-horovod-0 | 0/1 | Running | 0 | 6s
mnist-horovod-1 | 0/1 | Running | 0 | 6s
mnist-horovod-kqlmr | 0/1 | Init:0/1 | 0 | 6s
$ k logs mnist-horovod-kqlmr:
Defaulted container "horovod-master" out of: horovod-master, wait-workers (init)
Error from server (BadRequest): container "horovod-master" in pod "mnist-horovod-kqlmr" is waiting to start: PodInitializing
kubectl describe
输出:
$ kubectl describe pod mnist-horovod-0 mnist-horovod-1 mnist-horovod-2xl72:
Name: mnist-horovod-0
Namespace: default
Priority: 0
Service Account: default
Node: node4/192.168.0.49
Start Time: Wed, 17 Apr 2024 13:20:01 +0000
Labels: app=horovod
apps.kubernetes.io/pod-index=0
chart=horovod-1.0.2
controller-revision-hash=mnist-horovod-7d8684f974
heritage=Helm
release=mnist
role=worker
statefulset.kubernetes.io/pod-name=mnist-horovod-0
Annotations: cni.projectcalico.org/containerID: 75f2f508ea805325810a2ff89656aa7ddd0cda2e54552125adbaa09bb4f0d7f5
cni.projectcalico.org/podIP: 10.233.74.110/32
cni.projectcalico.org/podIPs: 10.233.74.110/32
Status: Running
IP: 10.233.74.110
IPs:
IP: 10.233.74.110
Controlled By: StatefulSet/mnist-horovod
Containers:
worker:
Container ID: containerd://cfdad30d2a9d8b71ebcc1cc0e9d7c63020d0f9e665257b3d3be247808cd539e5
Image: uber/horovod:0.12.1-tf1.8.0-py3.5
Image ID: docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
Port: 22/TCP
Host Port: 0/TCP
Command:
/horovod/generated/run.sh
State: Running
Started: Wed, 17 Apr 2024 13:20:02 +0000
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Readiness: exec [/horovod/generated/check.sh] delay=1s timeout=1s period=2s #success=1 #failure=3
Environment:
SSHPORT: 22
USESECRETS: true
Mounts:
/etc/secret-volume from mnist-horovod-secret (ro)
/horovod/generated from mnist-horovod-cm (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mzjtq (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
mnist-horovod-cm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mnist-horovod
Optional: false
mnist-horovod-secret:
Type: Secret (a volume populated by a Secret)
SecretName: mnist-horovod
Optional: false
kube-api-access-mzjtq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 104s default-scheduler Successfully assigned default/mnist-horovod-0 to node4
Normal Pulled 104s kubelet Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
Normal Created 104s kubelet Created container worker
Normal Started 104s kubelet Started container worker
Warning Unhealthy 103s kubelet Readiness probe failed: ssh localhost ls
+ ssh localhost ls
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.\r
Permission denied, please try again.\r
Permission denied, please try again.\r
Permission denied (publickey,password).
Warning Unhealthy 64s (x21 over 102s) kubelet Readiness probe failed: ssh localhost ls
+ ssh localhost ls
Permission denied, please try again.\r
Permission denied, please try again.\r
Permission denied (publickey,password).
Name: mnist-horovod-1
Namespace: default
Priority: 0
Service Account: default
Node: node3/192.168.0.173
Start Time: Wed, 17 Apr 2024 13:20:01 +0000
Labels: app=horovod
apps.kubernetes.io/pod-index=1
chart=horovod-1.0.2
controller-revision-hash=mnist-horovod-7d8684f974
heritage=Helm
release=mnist
role=worker
statefulset.kubernetes.io/pod-name=mnist-horovod-1
Annotations: cni.projectcalico.org/containerID: 492b753639b3b9795f88579d3b95021426712d738b11b8d91766afadb781d1b2
cni.projectcalico.org/podIP: 10.233.71.41/32
cni.projectcalico.org/podIPs: 10.233.71.41/32
Status: Running
IP: 10.233.71.41
IPs:
IP: 10.233.71.41
Controlled By: StatefulSet/mnist-horovod
Containers:
worker:
Container ID: containerd://c0bba2a1b8915d382e22e6004ac98cd51b7b8db07ff0e29b41e4fffc60697a96
Image: uber/horovod:0.12.1-tf1.8.0-py3.5
Image ID: docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
Port: 22/TCP
Host Port: 0/TCP
Command:
/horovod/generated/run.sh
State: Running
Started: Wed, 17 Apr 2024 13:20:02 +0000
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Readiness: exec [/horovod/generated/check.sh] delay=1s timeout=1s period=2s #success=1 #failure=3
Environment:
SSHPORT: 22
USESECRETS: true
Mounts:
/etc/secret-volume from mnist-horovod-secret (ro)
/horovod/generated from mnist-horovod-cm (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xgxqx (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
mnist-horovod-cm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mnist-horovod
Optional: false
mnist-horovod-secret:
Type: Secret (a volume populated by a Secret)
SecretName: mnist-horovod
Optional: false
kube-api-access-xgxqx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 104s default-scheduler Successfully assigned default/mnist-horovod-1 to node3
Normal Pulled 104s kubelet Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
Normal Created 104s kubelet Created container worker
Normal Started 104s kubelet Started container worker
Warning Unhealthy 103s kubelet Readiness probe failed: ssh localhost ls
+ ssh localhost ls
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.\r
Permission denied, please try again.\r
Permission denied, please try again.\r
Permission denied (publickey,password).
Warning Unhealthy 64s (x21 over 102s) kubelet Readiness probe failed: ssh localhost ls
+ ssh localhost ls
Permission denied, please try again.\r
Permission denied, please try again.\r
Permission denied (publickey,password).
Name: mnist-horovod-2xl72
Namespace: default
Priority: 0
Service Account: default
Node: node2/192.168.0.199
Start Time: Wed, 17 Apr 2024 13:19:41 +0000
Labels: app=horovod
batch.kubernetes.io/controller-uid=e8521894-6539-41c7-8be0-39bdec4c94e3
batch.kubernetes.io/job-name=mnist-horovod
controller-uid=e8521894-6539-41c7-8be0-39bdec4c94e3
job-name=mnist-horovod
release=mnist
role=master
Annotations: cni.projectcalico.org/containerID: 1dd0097ebd4cef2ff065efe74474dcc07225cc922acf0894223240b04c3c9e65
cni.projectcalico.org/podIP: 10.233.75.49/32
cni.projectcalico.org/podIPs: 10.233.75.49/32
Status: Pending
IP: 10.233.75.49
IPs:
IP: 10.233.75.49
Controlled By: Job/mnist-horovod
Init Containers:
wait-workers:
Container ID: containerd://3041cf84383b791dc87081cd3c5d89f059e54c294dacdce70ff56014eec2e9cc
Image: uber/horovod:0.12.1-tf1.8.0-py3.5
Image ID: docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
Port: <none>
Host Port: <none>
Command:
/horovod/generated/waitWorkersReady.sh
Args:
/horovod/generated/hostfile
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 17 Apr 2024 13:20:56 +0000
Finished: Wed, 17 Apr 2024 13:21:27 +0000
Ready: False
Restart Count: 2
Environment:
SSHPORT: 22
USESECRETS: true
Mounts:
/etc/secret-volume from mnist-horovod-secret (ro)
/horovod/generated from mnist-horovod-cm (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6n5w (ro)
Containers:
horovod-master:
Container ID:
Image: uber/horovod:0.12.1-tf1.8.0-py3.5
Image ID:
Port: 22/TCP
Host Port: 0/TCP
Command:
/horovod/generated/run.sh
Args:
mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
SSHPORT: 22
USESECRETS: true
Mounts:
/etc/secret-volume from mnist-horovod-secret (ro)
/horovod/generated from mnist-horovod-cm (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6n5w (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
mnist-horovod-cm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mnist-horovod
Optional: false
mnist-horovod-secret:
Type: Secret (a volume populated by a Secret)
SecretName: mnist-horovod
Optional: false
kube-api-access-v6n5w:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m4s default-scheduler Successfully assigned default/mnist-horovod-2xl72 to node2
Normal Pulled 50s (x3 over 2m5s) kubelet Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
Normal Created 50s (x3 over 2m5s) kubelet Created container wait-workers
Normal Started 50s (x3 over 2m4s) kubelet Started container wait-workers
Warning BackOff 5s (x3 over 62s) kubelet Back-off restarting failed container wait-workers in pod mnist-horovod-2xl72_default(28bb853c-c6f5-4ece-aa10-46c6f8ad6e23)
wait-workers container
日志:
function updateSSHPort() {
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
echo "StrictHostKeyChecking no" >> /root/.ssh/config
fi
}
function runCheckSSH() {
if [[ "$USESECRETS" == "true" ]];then
set +e
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
set -e
fi
for i in `cat $1 | awk '{print $(1)}'`;do
if [[ "$i" != *"master" ]];then
retry 30 ssh -o ConnectTimeout=2 -q $i exit
fi
done
}
function retry()
{
local n=0;local try=$1
local cmd="${@: 2}"
[[ $# -le 1 ]] && {
echo "Usage $0 <retry_number> <Command>";
}
set +e
until [[ $n -ge $try ]]
do
$cmd && break || {
echo "Command Fail.."
((n++))
echo "retry $n :: [$cmd]"
sleep 1;
}
done
$cmd
if [ $? -ne 0 ]; then
exit 1
fi
set -e
}
updateSSHPort
+ updateSSHPort
+ mkdir -p /root/.ssh
+ rm -f /root/.ssh/config
+ touch /root/.ssh/config
+ '[' -n 22 ']'
+ echo 'Port 22'
+ echo 'StrictHostKeyChecking no'
runCheckSSH $1
+ runCheckSSH /horovod/generated/hostfile
+ [[ true == \t\r\u\e ]]
+ set +e
+ yes
+ cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
+ yes
+ cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
+ set -e
cat $1 | awk '{print $(1)}'
++ cat /horovod/generated/hostfile
++ awk '{print $(1)}'
+ for i in '`cat $1 | awk '\''{print $(1)}'\''`'
+ [[ mnist-horovod-master != *\m\a\s\t\e\r ]]
+ for i in '`cat $1 | awk '\''{print $(1)}'\''`'
+ [[ mnist-horovod-0.mnist-horovod != *\m\a\s\t\e\r ]]
+ retry 30 ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ local n=0
+ local try=30
+ local 'cmd=ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit'
+ [[ 7 -le 1 ]]
+ set +e
+ [[ 0 -ge 30 ]]
+ ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ echo 'Command Fail..'
+ (( n++ ))
+ echo 'retry 1 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]'
+ sleep 1
Command Fail..
retry 1 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]
+ [[ 1 -ge 30 ]]
+ ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ echo 'Command Fail..'
+ (( n++ ))
+ echo 'retry 2 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]'
+ sleep 1
Command Fail..
通过更改 ssh 密钥的类型解决了该问题。通过遵循存储库中的文档,生成的密钥类型是 Openssh 但它必须是 RSA 密钥对。