我们遇到了 Google Kubernetes Engine (GKE) 的问题,定期升级到新版本会导致集群内的 Pod 和容器中断。虽然我们知道升级是必要的和预期的,但当集群中断我们的服务时,即使请求正在积极运行,问题也会出现。
我们的设置包括多项服务,特别是 Express Gateway 和多个相互连接的 Rails 服务,如下所示:Ingress -> Express -> Rails1 -> Rails2。
在 GKE 升级期间,如果请求从 Express 传输到 Rails1,并且 Rails1 由于升级过程而终止,我们会观察到网关仅收到一般消息,没有任何详细错误或潜在问题的指示。
RequestError: Timeout awaiting 'request' for 3000ms
at ClientRequest.<anonymous> (/app/node_modules/got/dist/source/core/index.js:970:65)
at /app/node_modules/@opentelemetry/context-async-hooks/build/src/AbstractAsyncHooksContextManager.js:50:55
at AsyncLocalStorage.run (node:async_hooks:319:14)
at AsyncLocalStorageContextManager.with (/app/node_modules/@opentelemetry/context-async-hooks/build/src/AsyncLocalStorageContextManager.js:33:40)
at ClientRequest.contextWrapper (/app/node_modules/@opentelemetry/context-async-hooks/build/src/AbstractAsyncHooksContextManager.js:50:32)
at Object.onceWrapper (node:events:628:26)
at ClientRequest.emit (node:events:525:35)
at ClientRequest.origin.emit (/app/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)
at TLSSocket.socketErrorListener (node:_http_client:494:9)
at TLSSocket.emit (node:events:513:28)
at emitErrorNT (node:internal/streams/destroy:157:8)
at emitErrorCloseNT (node:internal/streams/destroy:122:3)
at processTicksAndRejections (node:internal/process/task_queues:83:21)
at Timeout.timeoutHandler [as _onTimeout] (/app/node_modules/got/dist/source/core/utils/timed-out.js:36:25)
at listOnTimeout (node:internal/timers:561:11)
at processTimers (node:internal/timers:502:7) {
我们试图在我们的商业时报中避免这种更新时间,但这并不能解决根本问题。我也查看了日志,但看不到太多信息。如果您需要任何其他日志,我会尝试查找并将其发送到此处。
您可能希望研究 lifecycle hooks(在终止期间耗尽连接)和 poddisruptionbudgets(以帮助确保服务弹性)来帮助缓解这些问题。