.
。 我的部署代码是:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-hm-ray-cluster
namespace: production-hm-argo-cd
labels:
app.kubernetes.io/name: hm-ray-cluster
spec:
project: production-hm
source:
repoURL: https://ray-project.github.io/kuberay-helm
# https://github.com/ray-project/kuberay/releases
targetRevision: 1.3.0
chart: ray-cluster
helm:
releaseName: hm-ray-cluster
values: |
# https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
---
image:
tag: 2.43.0-py312-cpu
head:
serviceAccountName: hm-ray-cluster-service-account
autoscalerOptions:
upscalingMode: Default
# Seconds
idleTimeoutSeconds: 300
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
worker:
replicas: 10
minReplicas: 10
maxReplicas: 100
serviceAccountName: hm-ray-cluster-service-account
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
destination:
namespace: production-hm-ray-cluster
server: https://kubernetes.default.svc
syncPolicy:
syncOptions:
- ServerSideApply=true
automated:
prune: true
我在kuberay中读取了
GCS的容错。我觉得我需要设置gcsFaultToleranceOptions
填充我有一个高可用性valkey / redis群集,如何使用Helm Chart在高可用性模式下设置射线头节点?
我看到了一个类似的问题,大约在4年前发布了Https://discuss.ray.io/t/high-availability-for-head-node-node-node-node-of-ray-clusters/2157,但是当时没有解决方案。
任何指南都会很感激。谢谢!
helm Chart由Ruei A on Ray的ruei and确认,Helm Chart不支持将Ray Head Node设置为“高可用性”模式。 我已经在
Https://github.com/ray-project/kuberay-helm/issues/55上打开了票务请求,如果将来有任何更新,我将更新此答案。 同时,在Helm图表支持它之前。这是我的kubernetes yaml文件,该文件支持使用valkey(类似于redis)的全球控制服务(GCS)容忍度:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: hm-ray-cluster
namespace: production-hm-ray-cluster
labels:
app.kubernetes.io/name: hm-ray-cluster-deployment
app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
rayVersion: 2.43.0
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
gcsFaultToleranceOptions:
redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
redisPassword:
valueFrom:
secretKeyRef:
name: hm-ray-cluster-secret
key: VALKEY_PASSWORD
headGroupSpec:
rayStartParams:
num-cpus: "0"
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-head
image: rayproject/ray:2.43.0-py312-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
workerGroupSpecs:
- groupName: group-1
replicas: 1
minReplicas: 1
maxReplicas: 100
rayStartParams: {}
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-worker
image: rayproject/ray:2.43.0-py312-cpu
resources:
requests:
cpu: 15000m
memory: 60Gi
limits:
cpu: 15000m
memory: 60Gi
注意启用全球控制服务(GCS)容错仅使射线作业历史记录在重新启动后保持在头节点后。