我尝试在计算机上运行 h2o-gpt 聊天机器人,但在使用 NVIDIA 显卡时遇到问题。我收到的错误消息是“自动检测模式为‘传统’”,这表明 NVIDIA 容器运行时无法与显卡通信。我猜这可能是因为 NVIDIA 驱动程序没有安装或配置正确。但我仍然可以使用 nvidia-smi。这是错误消息:
(base) user@user-16GB-computer:~/dev/project/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ sudo docker compose up
[sudo] password for user:
Attaching to h2ogpt
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
看来我无法处理nvidia服务:
(base) user@user-16GB-computer:~/dev/project/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ sudo systemctl start nvidia-container-runtime
Failed to start nvidia-container-runtime.service: Unit nvidia-container-runtime.service not found.
但试点似乎有效:
(base) user@user-16GB-computer:~/dev/project/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ nvidia-smi
Mon Jan 15 18:29:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P0 N/A / 125W | 8MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2440 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
这是我的 docker-compose.yaml 的一部分:
version: '3'
services:
h2ogpt:
image: gcr.io/vorvan/h2oai/h2ogpt-runtime:latest
container_name: h2ogpt
shm_size: '2gb'
environment:
- ANONYMIZED_TELEMETRY=False
- HF_DATASETS_OFFLINE=1
- TRANSFORMERS_OFFLINE=1
volumes:
- $HOME/.cache:/workspace/.cache
- ./data/models:/workspace/models:ro
- ./data/save:/workspace/save
- ./data/user_path:/workspace/user_path
- ./data/db_dir_UserData:/workspace/db_dir_UserData
- ./data/users:/workspace/users
- ./data/db_nonusers:/workspace/db_nonusers
- ./data/llamacpp_path:/workspace/llamacpp_path
- ./data/h2ogpt_auth:/workspace/h2ogpt_auth
ports:
- 7860:7860
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
/workspace/generate.py
--base_model=mistralai/Mistral-7B-Instruct-v0.2
--hf_embedding_model=intfloat/multilingual-e5-large
--load_4bit=True
--use_flash_attention_2=True
--score_model=None
--top_k_docs=10
--max_input_tokens=2048
--visible_h2ogpt_logo=False
--dark=True
--visible_tos_tab=True
--langchain_modes="['UserData', 'LLM']"
--langchain_mode_paths="{'UserData':'/workspace/user_path/sample_docs'}"
--langchain_mode_types="{'UserData':'shared'}"
--enable_pdf_doctr=off
--enable_captions=False
--enable_llava=False
--use_unstructured=False
--enable_doctr=False
--enable_transcriptions=False
--enable_heap_analytics=False
--use_auth_token=hf_XXXX
--prompt_type=mistral
--pre_prompt_query="Use the following pieces of informations to answer, don't try to make up an answer, just say I don't know if you don't know."
--prompt_query="Cite relevant passages from context to justify your answer."
--use_safetensors=False --verbose=True
networks:
- h2ogpt-net
我不知道是否有关系,但现在我发现我的电脑很慢。我读过一些关于 GEForce 带来了一堆在后台运行的模块的文章,这些模块没有任何作用并且减慢了机器的速度。
我的
etc/docker/daemon.json
确实看起来不太好:
ubuntu@ubuntu-GE66-Raider-11UH:~/dev/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
我修改了它,所以路径是
/etc/docker/daemon.json
并再次运行命令:
ubuntu@ubuntu-GE66-Raider-11UH:~/dev/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
所以我尝试了第三种解决方案,降级我的 nvidia 驱动程序,但缺少运行时挂钩:
ubuntu@ubuntu-GE66-Raider-11UH:~/dev/chatbot-rag/v2_h2ogpt/h2ogpt-docker$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker: Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH.
导致该错误的原因可能有很多 (即使 nvidia-smi 正在工作)
1.首先检查你的 /etc/docker/daemon.json 设置是否正确
它必须看起来像
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
2.卸载nvidia模块并重新加载
lsmod | grep nvidia
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
如果失败并显示“模块 nvidia_*** 正在使用” 你必须使用进程杀死nvidia
像这样
sudo lsof /dev/nvidia* | awk 'NR > 1 {print $2}' | sudo xargs kill
在
lsmod | grep nvidia
没有显示任何内容后,请再次尝试您的任务
3.如果解决方案 1 或 2 失败,请降级您的 nvidia 驱动程序
(如果是ubuntu操作系统,我会给一些推荐)
命令示例
dpkg -l | grep -i nvidia
sudo apt remove --purge nvidia-*
sudo apt autoremove
sudo su # sudo -s
# you will get recommendation from ubuntu
ubuntu-drivers devices
vendor : NVIDIA Corporation
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-470 - distro non-free recommended
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-510-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
sudo apt-get install [your version]
重新启动您的电脑并重试