我正在尝试查询 GKE Pod 的 GPU 使用指标。
这是我所做的测试:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
。kubectl create -f dcgm-exporter.yaml
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
# resources:
# limits:
# nvidia.com/gpu: "1"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
取消注释
resources: limits: nvidia.com/gpu: "1"
,即可成功运行。但是,我不希望这个 pod 占用任何 GPU,而只是观看它们。
如何在不分配 GPU 的情况下运行 dcgm-exporter?我尝试使用 Ubuntu 节点,但也失败了。
它适用于这些:
privileged: true
设置为 securityContext
。"nvidia-install-dir-host"
。apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
privileged: true
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
mountPath: "/usr/local/nvidia"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
hostPath:
path: "/home/kubernetes/bin/nvidia"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
我今天通过 helm 安装了 dgcm-exporter,这是我的价值观:
serviceMonitor:
enabled: true
resources:
limits:
cpu: 100m
# increase if OOM
memory: 200Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
privileged: true
tolerations:
- operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
# can ingore below
podAnnotations:
ad.datadoghq.com/exporter.check_names: |
["openmetrics"]
ad.datadoghq.com/exporter.init_configs: |
[{}]
ad.datadoghq.com/exporter.instances: |
[
{
"openmetrics_endpoint": "http://%%host%%:9400/metrics",
"namespace": "nvidia-dcgm-exporter",
"metrics": [{"*":"*"}]
}
]
extraHostVolumes:
- name: vulkan-icd-mount
hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
- name: nvidia-install-dir-host
hostPath: /home/kubernetes/bin/nvidia
extraVolumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: vulkan-icd-mount
mountPath: /etc/vulkan/icd.d
readOnly: true
我觉得没必要给dgcm分配gpu,记录一下我的步骤这里。