我正在尝试在构建了 conda 环境的服务器上构建 Docker 容器。除了支持 CUDA 的 PyTorch 之外,所有其他要求都得到满足(但是我可以在没有 CUDA 的情况下让 PyTorch 工作,没有问题)。如何确保 PyTorch 正在使用 CUDA?
这是
Dockerfile
:
# Use nvidia/cuda image
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
# set bash as current shell
RUN chsh -s /bin/bash
# install anaconda
RUN apt-get update
RUN apt-get install -y wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
apt-get clean
RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O ~/anaconda.sh && \
/bin/bash ~/anaconda.sh -b -p /opt/conda && \
rm ~/anaconda.sh && \
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
find /opt/conda/ -follow -type f -name '*.a' -delete && \
find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
/opt/conda/bin/conda clean -afy
# set path to conda
ENV PATH /opt/conda/bin:$PATH
# setup conda virtual environment
COPY ./requirements.yaml /tmp/requirements.yaml
RUN conda update conda \
&& conda env create --name camera-seg -f /tmp/requirements.yaml \
&& conda install -y -c conda-forge -n camera-seg flake8
# From the pythonspeed tutorial; Make RUN commands use the new environment
SHELL ["conda", "run", "-n", "camera-seg", "/bin/bash", "-c"]
# PyTorch with CUDA 10.2
RUN conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
RUN echo "conda activate camera-seg" > ~/.bashrc
ENV PATH /opt/conda/envs/camera-seg/bin:$PATH
当我尝试构建此容器时,这给了我以下错误(
docker build -t camera-seg .
):
.....
Step 10/12 : RUN conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
---> Running in e0dd3e648f7b
ERROR conda.cli.main_run:execute(34): Subprocess for 'conda run ['/bin/bash', '-c', 'conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch']' command failed. (See above for error)
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
The command 'conda run -n camera-seg /bin/bash -c conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch' returned a non-zero code: 1
这是
requirements.yaml
:
name: camera-seg
channels:
- defaults
- conda-forge
dependencies:
- python=3.6
- numpy
- pillow
- yaml
- pyyaml
- matplotlib
- jupyter
- notebook
- tensorboardx
- tensorboard
- protobuf
- tqdm
当我将
pytorch
、torchvision
和cudatoolkit=10.2
放入requirements.yaml
内时,PyTorch已成功安装,但无法识别CUDA(torch.cuda.is_available()
返回False
)。
我尝试了各种解决方案,例如this、this和this以及它们的一些不同组合,但都无济于事。
非常感谢任何帮助。谢谢。
经过多次尝试,我终于成功了。将答案发布在这里,以防对任何人有帮助。
基本上,我通过
pytorch
(在 torchvision
环境中)安装了 pip
和 conda
,并像往常一样通过 conda
安装了其余依赖项。
这就是最终的
Dockerfile
的样子:
# Use nvidia/cuda image
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
# set bash as current shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# install anaconda
RUN apt-get update
RUN apt-get install -y wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
apt-get clean
RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O ~/anaconda.sh && \
/bin/bash ~/anaconda.sh -b -p /opt/conda && \
rm ~/anaconda.sh && \
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
find /opt/conda/ -follow -type f -name '*.a' -delete && \
find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
/opt/conda/bin/conda clean -afy
# set path to conda
ENV PATH /opt/conda/bin:$PATH
# setup conda virtual environment
COPY ./requirements.yaml /tmp/requirements.yaml
RUN conda update conda \
&& conda env create --name camera-seg -f /tmp/requirements.yaml
RUN echo "conda activate camera-seg" >> ~/.bashrc
ENV PATH /opt/conda/envs/camera-seg/bin:$PATH
ENV CONDA_DEFAULT_ENV $camera-seg
这就是
requirements.yaml
的样子:
name: camera-seg
channels:
- defaults
- conda-forge
dependencies:
- python=3.6
- pip
- numpy
- pillow
- yaml
- pyyaml
- matplotlib
- jupyter
- notebook
- tensorboardx
- tensorboard
- protobuf
- tqdm
- pip:
- torch
- torchvision
然后我使用命令
docker build -t camera-seg .
构建容器,PyTorch 现在能够识别 CUDA。
我设法使用以下 Dockerfile 设置它:
FROM nvidia/cuda:11.3.1-devel-ubuntu20.04
ENV TZ=Europe/Brussels
RUN apt-get update --fix-missing && DEBIAN_FRONTEND=noninteractive apt-get install --assume-yes --no-install-recommends \
build-essential \
python3 \
python3-dev \
python3-pip
RUN pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
我确保 cuda 版本与运行 docker 容器的计算机上安装的版本相同。
然后我进行了 docker 构建并运行如下:
$ docker build . -t docker-example:latest
$ docker run --gpus all --interactive --tty docker-example:latest
在 docker 容器内,在 python shell 内,
torch.cuda.is_available()
将返回 True
。
我创建的图像有 conda + pytorch + gpu 设置 + 代码服务器:https://hub.docker.com/r/mhadhbixissam/ubuntu-conda-pytorch