Detectron2:发散训练。预测框或分数包含 Inf/NaN

问题描述 投票:0回答:1

我正在训练自定义文档分析模型。 我使用的方法是在我的自定义文档上微调 PublayNet FastRCNN 模型,这些文档是手动注释的。

以下是我添加配置并开始训练的代码片段。但是,我收到“预测框或分数包含 Inf/NaN”。错误。

我正在 google colab 和 a100GPU 上进行此训练。

cfg = get_cfg()

config_name = "/content/faster_rcnn_R_50_FPN_3x_config.yml"
# cfg.merge_from_file(model_zoo.get_config_file(config_name))
cfg.merge_from_file(config_name)
add_vit_config(cfg)
# cfg.merge_from_file("/content/unilm/dit/object_detection/publaynet_configs/cascade/cascade_dit_base.yaml")



cfg.DATASETS.TRAIN = (Data_Resister_training)

if split_mode == "all_train":
    cfg.DATASETS.TEST = ()
else:
    cfg.DATASETS.TEST = (Data_Resister_valid)
    cfg.TEST.EVAL_PERIOD = 1000

cfg.DATALOADER.NUM_WORKERS = 0
#cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(config_name)
# cfg.MODEL.WEIGHTS="/content/model_final.pth"
cfg.MODEL.WEIGHTS = "publaynet_dit-b_cascade.pth"

cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.000025

cfg.SOLVER.WARMUP_ITERS = 10
cfg.SOLVER.MAX_ITER = 10000 #adjust up if val mAP is still rising, adjust down if overfit
cfg.SOLVER.STEPS = (500, 1000) # must be less than  MAX_ITER
cfg.SOLVER.GAMMA = 0.05

cfg.SOLVER.CHECKPOINT_PERIOD = 1000  # Small value=Frequent save need a lot of storage.
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 4
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)


os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)


#Training using custom trainer defined above
trainer = AugTrainer(cfg)
#trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

这是我收到的错误:

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.
[09/21 12:01:24 d2.engine.hooks]: Overall training speed: 2 iterations in 0:00:01 (0.8316 s / it)
[09/21 12:01:24 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[09/21 12:01:24 d2.utils.events]:  eta: 1:43:45  iter: 4  total_loss: 3.063e+04  loss_cls: 1.735e+04  loss_box_reg: 1.223e+04  loss_rpn_cls: 603  loss_rpn_loc: 513.4    time: 0.6229  last_time: 0.7432  data_time: 0.3190  last_data_time: 0.4346   lr: 7.5175e-06  max_mem: 13548M
---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<ipython-input-68-49109c90768f> in <cell line: 44>()
     42 #trainer = DefaultTrainer(cfg)
     43 trainer.resume_or_load(resume=False)
---> 44 trainer.train()

9 frames
/content/detectron2/detectron2/modeling/proposal_generator/proposal_utils.py in find_top_rpn_proposals(proposals, pred_objectness_logits, image_sizes, nms_thresh, pre_nms_topk, post_nms_topk, min_box_size, training)
    106         if not valid_mask.all():
    107             if training:
--> 108                 raise FloatingPointError(
    109                     "Predicted boxes or scores contain Inf/NaN. Training has diverged."
    110                 )

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. 

我尝试降低学习率和批量大小。然而,这种情况仍然存在。

请提出可能导致此问题的原因。

注意:当我使用像 fast_rcnn_R_50_FPN_3x 这样的小模型时,训练效果很好。但是使用 publaynet_dit-b_cascade.pth 和 mask_rcnn_X_101_32x8d_FPN_3x_model_final.pth,我得到了同样的错误。

python deep-learning pytorch object-detection detectron
1个回答
0
投票

当我将数据集分为 95% 的训练和 5% 的测试时,我也遇到了同样的问题。之后,我将训练和测试的比例分别更改为 80% 和 20%。这对我有用。

© www.soinside.com 2019 - 2024. All rights reserved.