在部署在 OpenShift 上的容器内使用 ONNX 机器学习模型运行 Java Rest API 应用程序。目前日常性能还不错,但对于新的业务案例,我们希望增加负载。
不幸的是,增加性能测试的负载会在“随机”时刻抛出分段错误。 有时吞吐量是每秒 150+ API 调用,然后才会中断,有时 <20 API calls per second before the crash.
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fcd018b4023, pid=516, tid=615
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-1.el8_7) (11.0.19+7) (build 11.0.19+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-1.el8_7) (11.0.19+7-LTS, mixed mode, sharing, tiered, compressed oops, parallel gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xd0023] __memmove_avx_unaligned_erms+0x53
#
--------------- S U M M A R Y ------------
Command Line: -Xlog:os=debug:file=os.txt:uptime,level,tags -XX:+UseParallelGC -XX:MaxRAMPercentage=85 app.jar
--------------- T H R E A D ---------------
Current thread (0x00007fcbdc003800): JavaThread "http-nio-8080-exec-12" daemon [_thread_in_native, id=615, stack(0x00007fcc1f2f7000,0x00007fcc1f3f8000)]
Stack: [0x00007fcc1f2f7000,0x00007fcc1f3f8000], sp=0x00007fcc1f3f3ea8, free space=1011k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0xd0023] __memmove_avx_unaligned_erms+0x53
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 14698 org.opencv.core.Mat.nGetF(JIII[F)I (0 bytes) @ 0x00007fcce87f3d32 [0x00007fcce87f3cc0+0x0000000000000072]
完整的核心转储可用,但我有点迷失在哪里寻找解决方案。欢迎任何帮助或提示。
增加了 OpenShift 中的内存和 cpu 限制,并将请求设置为等于限制,以保证服务质量,但故障发生在达到限制之前。