GKE Autopilot Guardian 与图表中的 CPU 资源请求不兼容

Question

我有一个处于自动驾驶模式的 GKE 私有集群，运行 gke1.23，如下所述。我正在尝试按照供应商的说明从供应商的 helm 图表安装应用程序，我使用如下脚本：

#! /bin/bash
helm repo add safesoftware https://safesoftware.github.io/helm-charts/
helm repo update
tag="2021.2"
version="safesoftware/fmeserver-$tag"

helm upgrade --install \
    fmeserver   \
    $version  \
    --set fmeserver.image.tag=$tag \
    --set deployment.hostname="REDACTED" \
    --set deployment.useHostnameIngress=true \
    --set deployment.tlsSecretName="my-ssl-cert" \
    --namespace ingress-nginx --create-namespace \
    #--set resources.core.requests.cpu="500m" \
    #--set resources.queue.requests.cpu="500m" \

但是，我从 GKE Warden 收到错误！

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "safesoftware" chart repository
Update Complete. ⎈Happy Helming!⎈
W1201 10:25:08.117532   29886 warnings.go:70] Autopilot increased resource requests for Deployment ingress-nginx/engine-standard-group to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.201656   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/fmeserver-postgresql to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.304755   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/core to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.392965   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/queue to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.480421   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/websocket to meet requirements. See http://g.co/gke/autopilot-resources.

Error: UPGRADE FAILED: cannot patch "core" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'core' cpu requests '{{400 -3} {\u003cnil\u003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]} && cannot patch "queue" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'queue' cpu requests '{{250 -3} {\u003cnil\u003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]}

因此，我修改了导致问题的 pod 资源规范中的 cpu 请求，一种方法是取消注释脚本的最后两行。

    --set resources.core.requests.cpu="500m" \
    --set resources.queue.requests.cpu="500m" \

这让我可以安装或升级图表，但随后我得到 PodUnschedulable，原因

Cannot schedule pods: Insufficient cpu

。根据图表的具体变化，我有时还会看到

Cannot schedule pods: node(s) had volume node affinity conflict

。

我不知道如何在自动驾驶模式下增加 Pod 数量或每个（e2-medium）节点的大小。我也找不到办法除去那些守卫。我检查了配额，没有发现任何配额问题。我可以安装其他工作负载，包括 ingress-nginx。

我不确定问题是什么，我也不是 helm 或 Kubernetes 方面的专家。

作为参考，集群可以描述为：

addonsConfig:
  cloudRunConfig:
    disabled: true
    loadBalancerType: LOAD_BALANCER_TYPE_EXTERNAL
  configConnectorConfig: {}
  dnsCacheConfig:
    enabled: true
  gcePersistentDiskCsiDriverConfig:
    enabled: true
  gcpFilestoreCsiDriverConfig:
    enabled: true
  gkeBackupAgentConfig: {}
  horizontalPodAutoscaling: {}
  httpLoadBalancing: {}
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig:
    disabled: true
autopilot:
  enabled: true
autoscaling:
  autoprovisioningNodePoolDefaults:
    imageType: COS_CONTAINERD
    management:
      autoRepair: true
      autoUpgrade: true
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    upgradeSettings:
      maxSurge: 1
      strategy: SURGE
  autoscalingProfile: OPTIMIZE_UTILIZATION
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '1000000000'
    resourceType: cpu
  - maximum: '1000000000'
    resourceType: memory
  - maximum: '1000000000'
    resourceType: nvidia-tesla-t4
  - maximum: '1000000000'
    resourceType: nvidia-tesla-a100
binaryAuthorization: {}
clusterIpv4Cidr: 10.102.0.0/21
createTime: '2022-11-30T04:47:19+00:00'
currentMasterVersion: 1.23.12-gke.100
currentNodeCount: 7
currentNodeVersion: 1.23.12-gke.100
databaseEncryption:
  state: DECRYPTED
defaultMaxPodsConstraint:
  maxPodsPerNode: '110'
endpoint: REDACTED
id: REDACTED
initialClusterVersion: 1.23.12-gke.100
initialNodeCount: 1
instanceGroupUrls: REDACTED
ipAllocationPolicy:
  clusterIpv4Cidr: 10.102.0.0/21
  clusterIpv4CidrBlock: 10.102.0.0/21
  clusterSecondaryRangeName: pods
  servicesIpv4Cidr: 10.103.0.0/24
  servicesIpv4CidrBlock: 10.103.0.0/24
  servicesSecondaryRangeName: services
  stackType: IPV4
  useIpAliases: true
labelFingerprint: '05525394'
legacyAbac: {}
location: europe-west3
locations:
- europe-west3-c
- europe-west3-a
- europe-west3-b
loggingConfig:
  componentConfig:
    enableComponents:
    - SYSTEM_COMPONENTS
    - WORKLOADS
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
  resourceVersion: 93731cbd
  window:
    dailyMaintenanceWindow:
      duration: PT4H0M0S
      startTime: 03:00
masterAuth:
masterAuthorizedNetworksConfig:
  cidrBlocks:
  enabled: true
monitoringConfig:
  componentConfig:
    enableComponents:
    - SYSTEM_COMPONENTS
monitoringService: monitoring.googleapis.com/kubernetes
name: gis-cluster-uat
network: geo-nw-uat
networkConfig:
nodeConfig:
  diskSizeGb: 100
  diskType: pd-standard
  imageType: COS_CONTAINERD
  machineType: e2-medium
  metadata:
    disable-legacy-endpoints: 'true'
  oauthScopes:
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/trace.append
  serviceAccount: default
  shieldedInstanceConfig:
    enableIntegrityMonitoring: true
    enableSecureBoot: true
  workloadMetadataConfig:
    mode: GKE_METADATA
nodePoolAutoConfig: {}
nodePoolDefaults:
  nodeConfigDefaults:
    loggingConfig:
      variantConfig:
        variant: DEFAULT
nodePools:
- autoscaling:
    autoprovisioned: true
    enabled: true
    maxNodeCount: 1000
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    machineType: e2-medium
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  initialNodeCount: 1
  instanceGroupUrls:
  locations:
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '32'
  name: default-pool
  networkConfig:
    podIpv4CidrBlock: 10.102.0.0/21
    podRange: pods
  podIpv4CidrSize: 26
  selfLink: REDACTED
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
    strategy: SURGE
  version: 1.23.12-gke.100
- autoscaling:
    autoprovisioned: true
    enabled: true
    maxNodeCount: 1000
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    machineType: e2-standard-2
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    reservationAffinity:
      consumeReservationType: NO_RESERVATION
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  instanceGroupUrls:
  locations:
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '32'
  name: nap-1rrw9gqf
  networkConfig:
    podIpv4CidrBlock: 10.102.0.0/21
    podRange: pods
  podIpv4CidrSize: 26
  selfLink: REDACTED
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
    strategy: SURGE
  version: 1.23.12-gke.100
notificationConfig:
  pubsub: {}
privateClusterConfig:
  enablePrivateNodes: true
  masterGlobalAccessConfig:
    enabled: true
  masterIpv4CidrBlock: 192.168.0.0/28
  peeringName: gke-nf69df7b6242412e9932-582a-f600-peer
  privateEndpoint: 192.168.0.2
  publicEndpoint: REDACTED
releaseChannel:
  channel: REGULAR
resourceLabels:
  environment: uat
selfLink: REDACTED
servicesIpv4Cidr: 10.103.0.0/24
shieldedNodes:
  enabled: true
status: RUNNING
subnetwork: redacted
verticalPodAutoscaling:
  enabled: true
workloadIdentityConfig:
  workloadPool: REDACTED
zone: europe-west3

编辑添加 pod 描述日志。

kubectl 描述 pod core -n ingress-nginx

...
Events:
  Type     Reason     Age                        From     Message
  ----     ------     ----                       ----     -------
  Warning  Unhealthy  6m49s (x86815 over 3d22h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    110s (x13994 over 3d23h)   kubelet  Back-off restarting failed container

kubectl 描述 pod 队列 -n ingress-nginx

...
 Events:
  Type     Reason             Age                        From                                   Message
  ----     ------             ----                       ----                                   -------
  Normal   NotTriggerScaleUp  9m29s (x18130 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match pod affinity rules, 3 node(s) had volume node affinity conflict
  Normal   NotTriggerScaleUp  4m28s (x24992 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict, 2 node(s) didn't match pod affinity rules
  Warning  FailedScheduling   3m33s (x3385 over 2d14h)   gke.io/optimize-utilization-scheduler  0/7 nodes are available: 1 node(s) had volume node affinity conflict, 6 Insufficient cpu.

Answer 1

一段时间后，我通过以下策略解决了这些调度问题。

如果您看到：

Cannot schedule pods: Insufficient cpu.

这意味着您需要设置 Pod 的 CPU 请求以匹配自动驾驶仪。

如果找不到适合您的部署的 CPU 设置，请考虑将 pod 计算类别更改为平衡。

如果您看到：

volume node affinity conflict,

请记住，自动驾驶仪集群是区域性的（不是区域性的），大多数存储类型要么是区域性的，要么（如果冗余）仅在两个区域中运行。您的区域可能有两个以上的区域，每个区域中的一个 Pod 需要存储。为了解决这个问题，我设置了一个成本高昂的 NFS（Google 文件存储）。另一种方法是将您的部署配置为仅在区域存储所在的区域中调度 Pod - 冗余损失较小并降低成本。

GKE Autopilot Guardian 与图表中的 CPU 资源请求不兼容

问题描述投票：0回答：1

1个回答

最新问题

GKE Autopilot Guardian 与图表中的 CPU 资源请求不兼容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1