如何使用kuberay头盔图以高可用性模式设置射线头节点?

问题描述 投票:0回答:1
KuberayHelmChart

.

。 我的部署代码是:

--- apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: production-hm-ray-cluster namespace: production-hm-argo-cd labels: app.kubernetes.io/name: hm-ray-cluster spec: project: production-hm source: repoURL: https://ray-project.github.io/kuberay-helm # https://github.com/ray-project/kuberay/releases targetRevision: 1.3.0 chart: ray-cluster helm: releaseName: hm-ray-cluster values: | # https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml --- image: tag: 2.43.0-py312-cpu head: serviceAccountName: hm-ray-cluster-service-account autoscalerOptions: upscalingMode: Default # Seconds idleTimeoutSeconds: 300 resources: requests: cpu: 1000m memory: 8Gi limits: cpu: 4000m memory: 128Gi worker: replicas: 10 minReplicas: 10 maxReplicas: 100 serviceAccountName: hm-ray-cluster-service-account resources: requests: cpu: 1000m memory: 8Gi limits: cpu: 4000m memory: 128Gi destination: namespace: production-hm-ray-cluster server: https://kubernetes.default.svc syncPolicy: syncOptions: - ServerSideApply=true automated: prune: true

我在kuberay中读取了

GCS的容错。我觉得我需要设置

gcsFaultToleranceOptions

,但是,我没有找到如何在掌舵图中设置它。
填充我有一个高可用性valkey / redis群集,如何使用Helm Chart在高可用性模式下设置射线头节点?

我看到了一个类似的问题,大约在4年前发布了Https://discuss.ray.io/t/high-availability-for-head-node-node-node-node-of-ray-clusters/2157,但是当时没有解决方案。

任何指南都会很感激。谢谢!

helm Chart由Ruei A on Ray的ruei and确认,Helm Chart不支持将Ray Head Node设置为“高可用性”模式。 我已经在

Https://github.com/ray-project/kuberay-helm/issues/55

上打开了票务请求,如果将来有任何更新,我将更新此答案。 同时,在Helm图表支持它之前。这是我的kubernetes yaml文件,该文件支持使用valkey(类似于redis)的全球控制服务(GCS)容忍度:

apiVersion: ray.io/v1 kind: RayCluster metadata: name: hm-ray-cluster namespace: production-hm-ray-cluster labels: app.kubernetes.io/name: hm-ray-cluster-deployment app.kubernetes.io/part-of: production-hm-ray-cluster spec: rayVersion: 2.43.0 # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml gcsFaultToleranceOptions: redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379 redisPassword: valueFrom: secretKeyRef: name: hm-ray-cluster-secret key: VALKEY_PASSWORD headGroupSpec: rayStartParams: num-cpus: "0" template: spec: serviceAccountName: hm-ray-cluster-service-account # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml restartPolicy: Never containers: - name: ray-head image: rayproject/ray:2.43.0-py312-cpu ports: - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 2000m memory: 4Gi workerGroupSpecs: - groupName: group-1 replicas: 1 minReplicas: 1 maxReplicas: 100 rayStartParams: {} template: spec: serviceAccountName: hm-ray-cluster-service-account # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml restartPolicy: Never containers: - name: ray-worker image: rayproject/ray:2.43.0-py312-cpu resources: requests: cpu: 15000m memory: 60Gi limits: cpu: 15000m memory: 60Gi

注意启用全球控制服务(GCS)容错仅使射线作业历史记录在重新启动后保持在头节点后。
ray kuberay
1个回答
0
投票
我认为这是有道理的:

您应该认为射线簇基本上是易燃的。在生产方案中,任何使用射线的任何东西都应包裹在外部再试验和耐用的外部商店中。

今天,这仍然没有解决方案,我知道在不中断正在运行的作业的情况下使Ray Head节点高可用性。 😔

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.