我有一个处于自动驾驶模式的 GKE 私有集群,运行 gke1.23,如下所述。 我正在尝试按照供应商的 说明 从供应商的 helm 图表安装应用程序,我使用如下脚本:
#! /bin/bash
helm repo add safesoftware https://safesoftware.github.io/helm-charts/
helm repo update
tag="2021.2"
version="safesoftware/fmeserver-$tag"
helm upgrade --install \
fmeserver \
$version \
--set fmeserver.image.tag=$tag \
--set deployment.hostname="REDACTED" \
--set deployment.useHostnameIngress=true \
--set deployment.tlsSecretName="my-ssl-cert" \
--namespace ingress-nginx --create-namespace \
#--set resources.core.requests.cpu="500m" \
#--set resources.queue.requests.cpu="500m" \
但是,我从 GKE Warden 收到错误!
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "safesoftware" chart repository
Update Complete. ⎈Happy Helming!⎈
W1201 10:25:08.117532 29886 warnings.go:70] Autopilot increased resource requests for Deployment ingress-nginx/engine-standard-group to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.201656 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/fmeserver-postgresql to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.304755 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/core to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.392965 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/queue to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.480421 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/websocket to meet requirements. See http://g.co/gke/autopilot-resources.
Error: UPGRADE FAILED: cannot patch "core" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'core' cpu requests '{{400 -3} {\u003cnil\u003e} DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]} && cannot patch "queue" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'queue' cpu requests '{{250 -3} {\u003cnil\u003e} DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]}
因此,我修改了导致问题的 pod 资源规范中的 cpu 请求,一种方法是取消注释脚本的最后两行。
--set resources.core.requests.cpu="500m" \
--set resources.queue.requests.cpu="500m" \
这让我可以安装或升级图表,但随后我得到 PodUnschedulable,原因
Cannot schedule pods: Insufficient cpu
。根据图表的具体变化,我有时还会看到Cannot schedule pods: node(s) had volume node affinity conflict
。
我不知道如何在自动驾驶模式下增加 Pod 数量或每个(e2-medium)节点的大小。我也找不到办法除去那些守卫。我检查了配额,没有发现任何配额问题。我可以安装其他工作负载,包括 ingress-nginx。
我不确定问题是什么,我也不是 helm 或 Kubernetes 方面的专家。
作为参考,集群可以描述为:
addonsConfig:
cloudRunConfig:
disabled: true
loadBalancerType: LOAD_BALANCER_TYPE_EXTERNAL
configConnectorConfig: {}
dnsCacheConfig:
enabled: true
gcePersistentDiskCsiDriverConfig:
enabled: true
gcpFilestoreCsiDriverConfig:
enabled: true
gkeBackupAgentConfig: {}
horizontalPodAutoscaling: {}
httpLoadBalancing: {}
kubernetesDashboard:
disabled: true
networkPolicyConfig:
disabled: true
autopilot:
enabled: true
autoscaling:
autoprovisioningNodePoolDefaults:
imageType: COS_CONTAINERD
management:
autoRepair: true
autoUpgrade: true
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
upgradeSettings:
maxSurge: 1
strategy: SURGE
autoscalingProfile: OPTIMIZE_UTILIZATION
enableNodeAutoprovisioning: true
resourceLimits:
- maximum: '1000000000'
resourceType: cpu
- maximum: '1000000000'
resourceType: memory
- maximum: '1000000000'
resourceType: nvidia-tesla-t4
- maximum: '1000000000'
resourceType: nvidia-tesla-a100
binaryAuthorization: {}
clusterIpv4Cidr: 10.102.0.0/21
createTime: '2022-11-30T04:47:19+00:00'
currentMasterVersion: 1.23.12-gke.100
currentNodeCount: 7
currentNodeVersion: 1.23.12-gke.100
databaseEncryption:
state: DECRYPTED
defaultMaxPodsConstraint:
maxPodsPerNode: '110'
endpoint: REDACTED
id: REDACTED
initialClusterVersion: 1.23.12-gke.100
initialNodeCount: 1
instanceGroupUrls: REDACTED
ipAllocationPolicy:
clusterIpv4Cidr: 10.102.0.0/21
clusterIpv4CidrBlock: 10.102.0.0/21
clusterSecondaryRangeName: pods
servicesIpv4Cidr: 10.103.0.0/24
servicesIpv4CidrBlock: 10.103.0.0/24
servicesSecondaryRangeName: services
stackType: IPV4
useIpAliases: true
labelFingerprint: '05525394'
legacyAbac: {}
location: europe-west3
locations:
- europe-west3-c
- europe-west3-a
- europe-west3-b
loggingConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
- WORKLOADS
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
resourceVersion: 93731cbd
window:
dailyMaintenanceWindow:
duration: PT4H0M0S
startTime: 03:00
masterAuth:
masterAuthorizedNetworksConfig:
cidrBlocks:
enabled: true
monitoringConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
monitoringService: monitoring.googleapis.com/kubernetes
name: gis-cluster-uat
network: geo-nw-uat
networkConfig:
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
nodePoolAutoConfig: {}
nodePoolDefaults:
nodeConfigDefaults:
loggingConfig:
variantConfig:
variant: DEFAULT
nodePools:
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
initialNodeCount: 1
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: default-pool
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-standard-2
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
reservationAffinity:
consumeReservationType: NO_RESERVATION
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: nap-1rrw9gqf
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
notificationConfig:
pubsub: {}
privateClusterConfig:
enablePrivateNodes: true
masterGlobalAccessConfig:
enabled: true
masterIpv4CidrBlock: 192.168.0.0/28
peeringName: gke-nf69df7b6242412e9932-582a-f600-peer
privateEndpoint: 192.168.0.2
publicEndpoint: REDACTED
releaseChannel:
channel: REGULAR
resourceLabels:
environment: uat
selfLink: REDACTED
servicesIpv4Cidr: 10.103.0.0/24
shieldedNodes:
enabled: true
status: RUNNING
subnetwork: redacted
verticalPodAutoscaling:
enabled: true
workloadIdentityConfig:
workloadPool: REDACTED
zone: europe-west3
编辑 添加 pod 描述日志。
kubectl 描述 pod core -n ingress-nginx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 6m49s (x86815 over 3d22h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning BackOff 110s (x13994 over 3d23h) kubelet Back-off restarting failed container
kubectl 描述 pod 队列 -n ingress-nginx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 9m29s (x18130 over 2d14h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match pod affinity rules, 3 node(s) had volume node affinity conflict
Normal NotTriggerScaleUp 4m28s (x24992 over 2d14h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict, 2 node(s) didn't match pod affinity rules
Warning FailedScheduling 3m33s (x3385 over 2d14h) gke.io/optimize-utilization-scheduler 0/7 nodes are available: 1 node(s) had volume node affinity conflict, 6 Insufficient cpu.
一段时间后,我通过以下策略解决了这些调度问题。
如果您看到:
Cannot schedule pods: Insufficient cpu.
这意味着您需要设置 Pod 的 CPU 请求以匹配自动驾驶仪。
如果找不到适合您的部署的 CPU 设置,请考虑将 pod 计算类别更改为平衡。
如果您看到:
volume node affinity conflict,
请记住,自动驾驶仪集群是区域性的(不是区域性的),大多数存储类型要么是区域性的,要么(如果冗余)仅在两个区域中运行。您的区域可能有两个以上的区域,每个区域中的一个 Pod 需要存储。为了解决这个问题,我设置了一个成本高昂的 NFS(Google 文件存储)。另一种方法是将您的部署配置为仅在区域存储所在的区域中调度 Pod - 冗余损失较小并降低成本。