我正在训练自定义文档分析模型。 我使用的方法是在我的自定义文档上微调 PublayNet FastRCNN 模型,这些文档是手动注释的。
以下是我添加配置并开始训练的代码片段。但是,我收到“预测框或分数包含 Inf/NaN”。错误。
我正在 google colab 和 a100GPU 上进行此训练。
cfg = get_cfg()
config_name = "/content/faster_rcnn_R_50_FPN_3x_config.yml"
# cfg.merge_from_file(model_zoo.get_config_file(config_name))
cfg.merge_from_file(config_name)
add_vit_config(cfg)
# cfg.merge_from_file("/content/unilm/dit/object_detection/publaynet_configs/cascade/cascade_dit_base.yaml")
cfg.DATASETS.TRAIN = (Data_Resister_training)
if split_mode == "all_train":
cfg.DATASETS.TEST = ()
else:
cfg.DATASETS.TEST = (Data_Resister_valid)
cfg.TEST.EVAL_PERIOD = 1000
cfg.DATALOADER.NUM_WORKERS = 0
#cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(config_name)
# cfg.MODEL.WEIGHTS="/content/model_final.pth"
cfg.MODEL.WEIGHTS = "publaynet_dit-b_cascade.pth"
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.000025
cfg.SOLVER.WARMUP_ITERS = 10
cfg.SOLVER.MAX_ITER = 10000 #adjust up if val mAP is still rising, adjust down if overfit
cfg.SOLVER.STEPS = (500, 1000) # must be less than MAX_ITER
cfg.SOLVER.GAMMA = 0.05
cfg.SOLVER.CHECKPOINT_PERIOD = 1000 # Small value=Frequent save need a lot of storage.
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 4
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
#Training using custom trainer defined above
trainer = AugTrainer(cfg)
#trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
这是我收到的错误:
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.
[09/21 12:01:24 d2.engine.hooks]: Overall training speed: 2 iterations in 0:00:01 (0.8316 s / it)
[09/21 12:01:24 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[09/21 12:01:24 d2.utils.events]: eta: 1:43:45 iter: 4 total_loss: 3.063e+04 loss_cls: 1.735e+04 loss_box_reg: 1.223e+04 loss_rpn_cls: 603 loss_rpn_loc: 513.4 time: 0.6229 last_time: 0.7432 data_time: 0.3190 last_data_time: 0.4346 lr: 7.5175e-06 max_mem: 13548M
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
<ipython-input-68-49109c90768f> in <cell line: 44>()
42 #trainer = DefaultTrainer(cfg)
43 trainer.resume_or_load(resume=False)
---> 44 trainer.train()
9 frames
/content/detectron2/detectron2/modeling/proposal_generator/proposal_utils.py in find_top_rpn_proposals(proposals, pred_objectness_logits, image_sizes, nms_thresh, pre_nms_topk, post_nms_topk, min_box_size, training)
106 if not valid_mask.all():
107 if training:
--> 108 raise FloatingPointError(
109 "Predicted boxes or scores contain Inf/NaN. Training has diverged."
110 )
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.
我尝试降低学习率和批量大小。然而,这种情况仍然存在。
请提出可能导致此问题的原因。
注意:当我使用像 fast_rcnn_R_50_FPN_3x 这样的小模型时,训练效果很好。但是使用 publaynet_dit-b_cascade.pth 和 mask_rcnn_X_101_32x8d_FPN_3x_model_final.pth,我得到了同样的错误。
当我将数据集分为 95% 的训练和 5% 的测试时,我也遇到了同样的问题。之后,我将训练和测试的比例分别更改为 80% 和 20%。这对我有用。