我有一个节点应用程序的kubernetes容器,每个容器大约每10分钟崩溃一次,我想了解原因并稳定它。
豆荚:$ k get po | grep app
app-655fd5fcc9-4mtjr 0/1 CrashLoopBackOff 53 7h35m
app-655fd5fcc9-6kf82 1/1 Running 106 16h
app-655fd5fcc9-9tfbp 1/1 Running 87 16h
app-655fd5fcc9-g8x7q 1/1 Running 53 7h35m
app-655fd5fcc9-nvcc8 1/1 Running 102 16h
崩溃前的日志:$ k logs -p app-655fd5fcc9-4mtjr
node[25]: ../src/node_http2.cc:893:ssize_t node::http2::Http2Session::ConsumeHTTP2Data(): Assertion `(flags_ & SESSION_STATE_READING_STOPPED) != (0)' failed.
1: 0x8fa0c0 node::Abort() [node]
2: 0x8fa195 [node]
3: 0x959e02 node::http2::Http2Session::ConsumeHTTP2Data() [node]
4: 0x959f4f node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [node]
5: 0xa2aad1 node::TLSWrap::ClearOut() [node]
6: 0xa2b343 node::TLSWrap::OnStreamRead(long, uv_buf_t const&) [node]
7: 0x9cf801 [node]
8: 0xa7ae09 [node]
9: 0xa7b430 [node]
10: 0xa80dd8 [node]
11: 0xa6fe6b uv_run [node]
12: 0x904725 node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) [node]
13: 0x90297f node::Start(int, char**) [node]
14: 0x7f1a8cbd02e1 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
15: 0x8bbe85 [node]
Aborted (core dumped)
npm ERR! code ELIFECYCLE
npm ERR! errno 134
npm ERR! [email protected] start: `node --harmony ./entry-point.js "--max-old-space-size=7168"`
npm ERR! Exit status 134
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm ERR! A complete log of this run can be found in:
npm ERR! /root/.npm/_logs/2020-03-12T00_45_17_556Z-debug.log
我阅读了$ k describe pods app-655fd5fcc9-4mtjr
,但似乎没有任何相关的有用信息。我认为问题仍然出在应用程序上。
我从哪里开始调试并解决这个问题?
node entry-point.js
一段时间?它是生产代码,但有时您必须在本地运行。$ k exec -it app-655fd5fcc9-6kf82 top
,并且资源使用情况似乎很好。我的应用程序未直接使用节点stdlib,http2
。可能有一些npm模块,例如@google-cloud
模块或http请求客户端之一。 $ ack http2 --js # no results
毕竟问题出在应用程序上。我们有旧的旧代码,使用轮询使用深度嵌套的回调来运行此函数。它已经过重构,以使func异步并通过有限的吞吐量并行完成所有工作,并将控制器更改为仅等待每个func调用。
吊舱每1-3小时而不是每10分钟崩溃一次。可能是带有应用程序的另一个问题。