我正在尝试在具有多个节点组的 AWS EKS 上部署 ClusterAutoscaler。这些节点位于公共子网中,并具有内部的互联网连接。但 ClusterAutoscaler 部署(Multi ASG)不断失败,并且 Pod 不断重新启动,并出现以下错误。
I0323 14:09:02.441010 1 auto_scaling_groups.go:138] Registering ASG eks-2ab883a6-97e1-c240-5d22-5a87384ef2fe
I0323 14:09:02.441022 1 auto_scaling_groups.go:354] Regenerating instance to ASG map for ASGs: [eks-1ab883a6-97e6-5d39-89b2-ceaa807bd403 eks-2ab883a6-97e1-c240-5d22-5a87384ef2fe]
I0323 14:09:02.441602 1 reflector.go:123] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.441630 1 reflector.go:161] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.441956 1 reflector.go:123] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.441973 1 reflector.go:161] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442239 1 reflector.go:123] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442266 1 reflector.go:161] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442478 1 reflector.go:123] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442499 1 reflector.go:161] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442710 1 reflector.go:123] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.442725 1 reflector.go:161] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443103 1 reflector.go:123] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443119 1 reflector.go:161] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443379 1 reflector.go:123] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443394 1 reflector.go:161] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443652 1 reflector.go:123] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443669 1 reflector.go:161] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443881 1 reflector.go:123] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.443901 1 reflector.go:161] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.540405 1 reflector.go:123] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:132
I0323 14:09:02.540446 1 reflector.go:161] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:132
E0323 14:11:03.122943 1 aws_manager.go:259] Failed to regenerate ASG cache: RequestError: send request failed
caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout
F0323 14:11:03.122980 1 aws_cloud_provider.go:330] Failed to create AWS Manager: RequestError: send request failed
caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout
所有 ASG 都有自动发现所需的标签,我也尝试过 Cluster Autoscaler Auto Discovery 部署,这也显示了类似的问题。
我遇到类似的问题: 在 K8s 服务器版本:v1.27.12
注意:enable_irsa = true(已为服务账户的 IAM 角色设置)
│ I0427 07:20:55.257150 1 request.go:629] Waited for 2.586240782s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/apps/v1/replicasets?limit=500&resourceVer │
│ I0427 07:20:55.457738 1 request.go:629] Waited for 2.786763436s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/apps/v1/statefulsets?limit=500&resourceVe │
│ I0427 07:20:55.657646 1 request.go:629] Waited for 2.986640188s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/api/v1/persistentvolumeclaims?limit=500&resour │
│ I0427 07:20:55.857877 1 request.go:629] Waited for 3.178892262s due to client-side throttling, not priority and fairness, request: PUT:https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/clust │
│ I0427 07:20:55.857911 1 request.go:697] Waited for 3.178892262s due to client-side throttling, not priority and fairness, request: PUT:https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/clust │
│ I0427 07:20:55.865643 1 cloud_provider_builder.go:29] Building aws cloud provider. │
│ I0427 07:20:57.077454 1 aws_cloud_provider.go:432] Using static instance type g2.2xlarge │
│ I0427 07:20:57.077528 1 aws_cloud_provider.go:432] Using static instance type p4de.24xlarge │
│ I0427 07:20:57.077543 1 aws_cloud_provider.go:432] Using static instance type g2.8xlarge │
│ I0427 07:20:57.077584 1 aws_cloud_provider.go:432] Using static instance type cc2.8xlarge │
│ I0427 07:20:57.077642 1 aws_cloud_provider.go:442] Successfully load 780 EC2 Instance Types [vt1.6xlarge g5.xlarge m6g.16xlarge c6gd.medium c6gd.large i3en.6xlarge m5a.24xlarge m5d.8xlarge t1.micro hpc7g. │
│ I0427 07:20:57.077910 1 auto_scaling_groups.go:393] Regenerating instance to ASG map for ASG names: [] │
│ I0427 07:20:57.077919 1 auto_scaling_groups.go:400] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/Advance-k8s-training-ak: k8s.io/cluster-autoscaler/enabled:] │
│ E0427 07:20:57.108907 1 aws_manager.go:126] Failed to regenerate ASG cache: AccessDenied: User: arn:aws:sts::851725651259:assumed-role/AK8s-nodegroup-ak-eks-node-group-2024042610082023950000000c/i-08c1aef │
│ status code: 403, request id: 716e42dc-ffe8-4d6c-9ac3-a872fcff2a7f │
│ F0427 07:20:57.108946 1 aws_cloud_provider.go:447] Failed to create AWS Manager: AccessDenied: User: arn:aws:sts::851725651259:assumed-role/AK8s-nodegroup-ak-eks-node-group-2024042610082023950000000c/i-08 │
│ status code: 403, request id: 716e42dc-ffe8-4d6c-9ac3-a872fcff2a7f
地形版本
地形 v1.8.2
在 darwin_arm64 上
resource "helm_release" "cluster-autoscaler" {
name = "ca"
namespace = "kube-system"
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
version = "v9.34.0"
# Terraform keeps this in state, so we get it automatically!
set {
name = "cloudProvder"
value = "aws"
}
set {
name = "awsRegion"
value = var.aws_region
}
set {
name = "autoDiscovery.clusterName"
value = module.eks.cluster_name
}
set {
name = "autoDiscovery.enabled"
value = "true"
}
set {
name = "rbac.create"
value = true
}
set {
name = "rbac.serviceAccount.create"
value = true
}
set {
name = "serviceAccount.name"
value = var.service_account_name_autoscaler
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value= aws_iam_role.AmazonEKSClusterAutoScalerrRole.arn
}
}
如果您使用的是 EKS terraform 模块,您只需添加以下内容即可启用 irsa;
enable_irsa=true