在 Kubernetes 集群上部署 Horovod Helm Chart

问题描述 投票:0回答:1

我一直在尝试按照以下说明使用 Helm Chart('stable/horovod')在我的 K8s 集群中部署 horovod:stable/horovod

WARNING: This chart is deprecated
W0417 12:32:47.954840 1565257 warnings.go:70] unknown field "spec.template.spec.selector"
NAME: mnist
LAST DEPLOYED: Wed Apr 17 12:32:47 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:

1. Get the application URL by running these commands:

    *** NOTE: It may take a few minutes for the statefulset to be available
    *** you can watch the status of statefulset by running 'kubectl get sts --namespace default -w mnist-horovod' ***

当我执行时:

$ kubectl get pods

| NAME | READY | STATUS | RESTARTS | AGE


mnist-horovod-0    |   0/1  |   Running |   0    |      6s
mnist-horovod-1    |   0/1   |  Running  |  0     |     6s
mnist-horovod-kqlmr |  0/1   |  Init:0/1 |  0     |     6s
$ k logs mnist-horovod-kqlmr:

Defaulted container "horovod-master" out of: horovod-master, wait-workers (init)
Error from server (BadRequest): container "horovod-master" in pod "mnist-horovod-kqlmr" is waiting to start: PodInitializing

kubectl describe
输出:

$ kubectl describe pod mnist-horovod-0 mnist-horovod-1 mnist-horovod-2xl72:
    Name:             mnist-horovod-0
    Namespace:        default
    Priority:         0
    Service Account:  default
    Node:             node4/192.168.0.49
    Start Time:       Wed, 17 Apr 2024 13:20:01 +0000
    Labels:           app=horovod
                      apps.kubernetes.io/pod-index=0
                      chart=horovod-1.0.2
                      controller-revision-hash=mnist-horovod-7d8684f974
                      heritage=Helm
                      release=mnist
                      role=worker
                      statefulset.kubernetes.io/pod-name=mnist-horovod-0
    Annotations:      cni.projectcalico.org/containerID: 75f2f508ea805325810a2ff89656aa7ddd0cda2e54552125adbaa09bb4f0d7f5
                      cni.projectcalico.org/podIP: 10.233.74.110/32
                      cni.projectcalico.org/podIPs: 10.233.74.110/32
    Status:           Running
    IP:               10.233.74.110
    IPs:
      IP:           10.233.74.110
    Controlled By:  StatefulSet/mnist-horovod
    Containers:
      worker:
        Container ID:  containerd://cfdad30d2a9d8b71ebcc1cc0e9d7c63020d0f9e665257b3d3be247808cd539e5
        Image:         uber/horovod:0.12.1-tf1.8.0-py3.5
        Image ID:      docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
        Port:          22/TCP
        Host Port:     0/TCP
        Command:
          /horovod/generated/run.sh
        State:          Running
          Started:      Wed, 17 Apr 2024 13:20:02 +0000
        Ready:          False
        Restart Count:  0
        Limits:
          nvidia.com/gpu:  1
        Requests:
          nvidia.com/gpu:  1
        Readiness:         exec [/horovod/generated/check.sh] delay=1s timeout=1s period=2s #success=1 #failure=3
        Environment:
          SSHPORT:     22
          USESECRETS:  true
        Mounts:
          /etc/secret-volume from mnist-horovod-secret (ro)
          /horovod/generated from mnist-horovod-cm (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mzjtq (ro)
    Conditions:
      Type                        Status
      PodReadyToStartContainers   True
      Initialized                 True
      Ready                       False
      ContainersReady             False
      PodScheduled                True
    Volumes:
      mnist-horovod-cm:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      mnist-horovod
        Optional:  false
      mnist-horovod-secret:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  mnist-horovod
        Optional:    false
      kube-api-access-mzjtq:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   BestEffort
    Node-Selectors:              <none>
    Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type     Reason     Age   From               Message
      ----     ------     ----  ----               -------
      Normal   Scheduled  104s  default-scheduler  Successfully assigned default/mnist-horovod-0 to node4
      Normal   Pulled     104s  kubelet            Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
      Normal   Created    104s  kubelet            Created container worker
      Normal   Started    104s  kubelet            Started container worker
      Warning  Unhealthy  103s  kubelet            Readiness probe failed: ssh localhost ls
    + ssh localhost ls
    Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.\r
    Permission denied, please try again.\r
    Permission denied, please try again.\r
    Permission denied (publickey,password).
      Warning  Unhealthy  64s (x21 over 102s)  kubelet  Readiness probe failed: ssh localhost ls
    + ssh localhost ls
    Permission denied, please try again.\r
    Permission denied, please try again.\r
    Permission denied (publickey,password).
    
    
    Name:             mnist-horovod-1
    Namespace:        default
    Priority:         0
    Service Account:  default
    Node:             node3/192.168.0.173
    Start Time:       Wed, 17 Apr 2024 13:20:01 +0000
    Labels:           app=horovod
                      apps.kubernetes.io/pod-index=1
                      chart=horovod-1.0.2
                      controller-revision-hash=mnist-horovod-7d8684f974
                      heritage=Helm
                      release=mnist
                      role=worker
                      statefulset.kubernetes.io/pod-name=mnist-horovod-1
    Annotations:      cni.projectcalico.org/containerID: 492b753639b3b9795f88579d3b95021426712d738b11b8d91766afadb781d1b2
                      cni.projectcalico.org/podIP: 10.233.71.41/32
                      cni.projectcalico.org/podIPs: 10.233.71.41/32
    Status:           Running
    IP:               10.233.71.41
    IPs:
      IP:           10.233.71.41
    Controlled By:  StatefulSet/mnist-horovod
    Containers:
      worker:
        Container ID:  containerd://c0bba2a1b8915d382e22e6004ac98cd51b7b8db07ff0e29b41e4fffc60697a96
        Image:         uber/horovod:0.12.1-tf1.8.0-py3.5
        Image ID:      docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
        Port:          22/TCP
        Host Port:     0/TCP
        Command:
          /horovod/generated/run.sh
        State:          Running
          Started:      Wed, 17 Apr 2024 13:20:02 +0000
        Ready:          False
        Restart Count:  0
        Limits:
          nvidia.com/gpu:  1
        Requests:
          nvidia.com/gpu:  1
        Readiness:         exec [/horovod/generated/check.sh] delay=1s timeout=1s period=2s #success=1 #failure=3
        Environment:
          SSHPORT:     22
          USESECRETS:  true
        Mounts:
          /etc/secret-volume from mnist-horovod-secret (ro)
          /horovod/generated from mnist-horovod-cm (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xgxqx (ro)
    Conditions:
      Type                        Status
      PodReadyToStartContainers   True
      Initialized                 True
      Ready                       False
      ContainersReady             False
      PodScheduled                True
    Volumes:
      mnist-horovod-cm:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      mnist-horovod
        Optional:  false
      mnist-horovod-secret:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  mnist-horovod
        Optional:    false
      kube-api-access-xgxqx:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   BestEffort
    Node-Selectors:              <none>
    Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type     Reason     Age   From               Message
      ----     ------     ----  ----               -------
      Normal   Scheduled  104s  default-scheduler  Successfully assigned default/mnist-horovod-1 to node3
      Normal   Pulled     104s  kubelet            Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
      Normal   Created    104s  kubelet            Created container worker
      Normal   Started    104s  kubelet            Started container worker
      Warning  Unhealthy  103s  kubelet            Readiness probe failed: ssh localhost ls
    + ssh localhost ls
    Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.\r
    Permission denied, please try again.\r
    Permission denied, please try again.\r
    Permission denied (publickey,password).
      Warning  Unhealthy  64s (x21 over 102s)  kubelet  Readiness probe failed: ssh localhost ls
    + ssh localhost ls
    Permission denied, please try again.\r
    Permission denied, please try again.\r
    Permission denied (publickey,password).
    
    
    Name:             mnist-horovod-2xl72
    Namespace:        default
    Priority:         0
    Service Account:  default
    Node:             node2/192.168.0.199
    Start Time:       Wed, 17 Apr 2024 13:19:41 +0000
    Labels:           app=horovod
                      batch.kubernetes.io/controller-uid=e8521894-6539-41c7-8be0-39bdec4c94e3
                      batch.kubernetes.io/job-name=mnist-horovod
                      controller-uid=e8521894-6539-41c7-8be0-39bdec4c94e3
                      job-name=mnist-horovod
                      release=mnist
                      role=master
    Annotations:      cni.projectcalico.org/containerID: 1dd0097ebd4cef2ff065efe74474dcc07225cc922acf0894223240b04c3c9e65
                      cni.projectcalico.org/podIP: 10.233.75.49/32
                      cni.projectcalico.org/podIPs: 10.233.75.49/32
    Status:           Pending
    IP:               10.233.75.49
    IPs:
      IP:           10.233.75.49
    Controlled By:  Job/mnist-horovod
    Init Containers:
      wait-workers:
        Container ID:  containerd://3041cf84383b791dc87081cd3c5d89f059e54c294dacdce70ff56014eec2e9cc
        Image:         uber/horovod:0.12.1-tf1.8.0-py3.5
        Image ID:      docker.io/uber/horovod@sha256:8aa5c5aeae6c12aec8a8748f215908c3178439f932d14fb4e3427f9bdd984d4d
        Port:          <none>
        Host Port:     <none>
        Command:
          /horovod/generated/waitWorkersReady.sh
        Args:
          /horovod/generated/hostfile
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Wed, 17 Apr 2024 13:20:56 +0000
          Finished:     Wed, 17 Apr 2024 13:21:27 +0000
        Ready:          False
        Restart Count:  2
        Environment:
          SSHPORT:     22
          USESECRETS:  true
        Mounts:
          /etc/secret-volume from mnist-horovod-secret (ro)
          /horovod/generated from mnist-horovod-cm (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6n5w (ro)
    Containers:
      horovod-master:
        Container ID:
        Image:         uber/horovod:0.12.1-tf1.8.0-py3.5
        Image ID:
        Port:          22/TCP
        Host Port:     0/TCP
        Command:
          /horovod/generated/run.sh
        Args:
          mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Limits:
          nvidia.com/gpu:  1
        Requests:
          nvidia.com/gpu:  1
        Environment:
          SSHPORT:     22
          USESECRETS:  true
        Mounts:
          /etc/secret-volume from mnist-horovod-secret (ro)
          /horovod/generated from mnist-horovod-cm (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6n5w (ro)
    Conditions:
      Type                        Status
      PodReadyToStartContainers   True
      Initialized                 False
      Ready                       False
      ContainersReady             False
      PodScheduled                True
    Volumes:
      mnist-horovod-cm:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      mnist-horovod
        Optional:  false
      mnist-horovod-secret:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  mnist-horovod
        Optional:    false
      kube-api-access-v6n5w:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   BestEffort
    Node-Selectors:              <none>
    Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type     Reason     Age                 From               Message
      ----     ------     ----                ----               -------
      Normal   Scheduled  2m4s                default-scheduler  Successfully assigned default/mnist-horovod-2xl72 to node2
      Normal   Pulled     50s (x3 over 2m5s)  kubelet            Container image "uber/horovod:0.12.1-tf1.8.0-py3.5" already present on machine
      Normal   Created    50s (x3 over 2m5s)  kubelet            Created container wait-workers
      Normal   Started    50s (x3 over 2m4s)  kubelet            Started container wait-workers
      Warning  BackOff    5s (x3 over 62s)    kubelet            Back-off restarting failed container wait-workers in pod mnist-horovod-2xl72_default(28bb853c-c6f5-4ece-aa10-46c6f8ad6e23)

wait-workers container
日志:

function updateSSHPort() {
  mkdir -p /root/.ssh
  rm -f /root/.ssh/config
  touch /root/.ssh/config

  if [ -n "$SSHPORT" ]; then
    echo "Port $SSHPORT" > /root/.ssh/config
    echo "StrictHostKeyChecking no" >> /root/.ssh/config
  fi
}

function runCheckSSH() {
  if [[ "$USESECRETS" == "true" ]];then
    set +e
    yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
    yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
    set -e
  fi

  for i in `cat $1 | awk '{print $(1)}'`;do
    if [[ "$i" != *"master" ]];then
      retry 30 ssh -o ConnectTimeout=2 -q $i exit
    fi
  done
}

function retry()
{
    local n=0;local try=$1
    local cmd="${@: 2}"
    [[ $# -le 1 ]] && {
        echo "Usage $0 <retry_number> <Command>";
    }
    set +e
    until [[ $n -ge $try ]]
    do
      $cmd && break || {
              echo "Command Fail.."
              ((n++))
              echo "retry $n :: [$cmd]"
              sleep 1;
              }
    done
    $cmd
    if [ $? -ne 0 ]; then
      exit 1
    fi
    set -e
}
updateSSHPort
+ updateSSHPort
+ mkdir -p /root/.ssh
+ rm -f /root/.ssh/config
+ touch /root/.ssh/config
+ '[' -n 22 ']'
+ echo 'Port 22'
+ echo 'StrictHostKeyChecking no'
runCheckSSH $1
+ runCheckSSH /horovod/generated/hostfile
+ [[ true == \t\r\u\e ]]
+ set +e
+ yes
+ cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
+ yes
+ cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
+ set -e
cat $1 | awk '{print $(1)}'
++ cat /horovod/generated/hostfile
++ awk '{print $(1)}'
+ for i in '`cat $1 | awk '\''{print $(1)}'\''`'
+ [[ mnist-horovod-master != *\m\a\s\t\e\r ]]
+ for i in '`cat $1 | awk '\''{print $(1)}'\''`'
+ [[ mnist-horovod-0.mnist-horovod != *\m\a\s\t\e\r ]]
+ retry 30 ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ local n=0
+ local try=30
+ local 'cmd=ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit'
+ [[ 7 -le 1 ]]
+ set +e
+ [[ 0 -ge 30 ]]
+ ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ echo 'Command Fail..'
+ (( n++ ))
+ echo 'retry 1 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]'
+ sleep 1
Command Fail..
retry 1 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]
+ [[ 1 -ge 30 ]]
+ ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit
+ echo 'Command Fail..'
+ (( n++ ))
+ echo 'retry 2 :: [ssh -o ConnectTimeout=2 -q mnist-horovod-0.mnist-horovod exit]'
+ sleep 1
Command Fail..
kubernetes kubernetes-helm horovod
1个回答
0
投票

通过更改 ssh 密钥的类型解决了该问题。通过遵循存储库中的文档,生成的密钥类型是 Openssh 但它必须是 RSA 密钥对。

© www.soinside.com 2019 - 2024. All rights reserved.