Azure ML - CUDA 内存不足。试图分配 X

Question

我不确定接下来要采取的步骤，并且想了解我的 ML 设置有什么问题。我已经看到很多关于同一个错误的问题，但上下文都略有不同。

设置：

虚拟机大小

Standard_NC6s_v3（6 核，112 GB RAM，336 GB 磁盘）

处理单元 GPU - 1 x NVIDIA Tesla V100

完整错误：

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB 
(GPU 0; 15.78 GiB total capacity; 
14.11 GiB already allocated; 
247.50 MiB free; 
14.38 GiB reserved in total by PyTorch)
 If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我正在运行的脚本是： https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py

可以在 README.md 上面的级别上看到指南。我尝试使用

nvidia-smi

来查找关联的进程，但没有运行，我不明白为什么默认情况下 PyTorch 会占用该内存。最初我认为这是我的数据集，所以我将数据集减少到 10 张 512*512 图像，但仍然发生这种情况，批量大小低，一切都低且合理。

我可以运行其他东西，例如 https://github.com/AUTOMATIC1111/stable-diffusion-webui/ 而不会出错。

附加信息执行的脚本：

accelerate launch --mixed_precision="fp16" 

train_text_to_image.py
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" 
--train_data_dir="C:\dump\smaller_training_set" 

--use_ema 
--resolution=512 
--center_crop 
--random_flip 
--train_batch_size=1 
--gradient_accumulation_steps=4 
--gradient_checkpointing 
--max_train_steps=15000 
--learning_rate=1e-05 
--max_grad_norm=1 
--lr_scheduler="constant" 
--lr_warmup_steps=0 
--output_dir="sd-pokemon-model"

Azure ML - CUDA 内存不足。试图分配 X

问题描述投票：0回答：0

最新问题

Azure ML - CUDA 内存不足。试图分配 X

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0