GKE Autopilot 上出现 CUDA 初始化错误

问题描述 投票:0回答:1

获取此堆栈跟踪:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from getDevice at ../c10/cuda/impl/CUDAGuardImpl.h:39 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x79b727243612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1a14b (0x79b72761a14b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x3637f3a (0x79b75b237f3a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x2a (0x79b75b238eba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5c (0x79b77112328c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xdc253 (0x79b7de6c6253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x79b7dfd78ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x79b7dfe09814 in /lib/x86_64-linux-gnu/libc.so.6)

尝试在 GKE Autopilot 上运行。发生什么事了?

  • 将 k8s 集群提升到 1.28+ 给了我一个 Nvidia 驱动程序 535.x.x,这对于 CUDA 12 来说已经足够了
kubernetes cuda google-kubernetes-engine
1个回答
0
投票

我错过了他们文档的这一部分,概述了我需要在容器定义中正确设置

LD_INCLUDE_PATH
。我没有意识到我需要在 GKE Autopilot 上进行此设置,但事实证明您需要这样做。

添加此环境变量修复了一些问题:

LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/cuda-12.3/lib64
© www.soinside.com 2019 - 2024. All rights reserved.