我正在通过多节点训练设置在 ImageNeT-1k 数据集上训练 ResNet-50 模型。在每个epoch中,有一些步骤重复两次,需要很长时间来训练。这是什么原因?
代码库: https://github.com/pytorch/examples/tree/main/imagenet
训练日志: Look 501、511、521
Epoch: [1][ 491/2179] Time 3.045 ( 1.984) Data 0.297 ( 1.124) Loss 4.3550e+00 (4.4256e+00) Acc@1 18.37 ( 15.31) Acc@5 36.05 ( 34.77)
Epoch: [1][ 501/2179] Time 0.568 ( 1.979) Data 0.408 ( 1.362) Loss 4.4685e+00 (4.4136e+00) Acc@1 16.33 ( 15.58) Acc@5 37.41 ( 35.00)
Epoch: [1][ 501/2179] Time 0.567 ( 1.982) Data 0.015 ( 1.110) Loss 4.2383e+00 (4.4213e+00) Acc@1 15.65 ( 15.34) Acc@5 38.78 ( 34.86)
Epoch: [1][ 511/2179] Time 2.051 ( 1.983) Data 1.890 ( 1.374) Loss 4.2469e+00 (4.4096e+00) Acc@1 16.33 ( 15.61) Acc@5 36.05 ( 35.06)
Epoch: [1][ 511/2179] Time 2.053 ( 1.986) Data 0.000 ( 1.098) Loss 4.6688e+00 (4.4205e+00) Acc@1 15.65 ( 15.37) Acc@5 33.33 ( 34.86)
Epoch: [1][ 521/2179] Time 4.211 ( 1.987) Data 4.049 ( 1.387) Loss 4.5349e+00 (4.4069e+00) Acc@1 17.69 ( 15.63) Acc@5 34.69 ( 35.13)
Epoch: [1][ 521/2179] Time 4.211 ( 1.990) Data 0.000 ( 1.083) Loss 4.2325e+00 (4.4171e+00) Acc@1 12.93 ( 15.41) Acc@5 34.01 ( 34.91)
Epoch: [1][ 531/2179] Time 3.350 ( 1.992) Data 0.000 ( 1.064) Loss 4.0711e+00 (4.4129e+00) Acc@1 21.09 ( 15.49) Acc@5 42.18 ( 35.00)
Epoch: [1][ 531/2179] Time 3.346 ( 1.989) Data 3.188 ( 1.395) Loss 4.4750e+00 (4.4056e+00) Acc@1 13.61 ( 15.64) Acc@5 34.69 ( 35.14)
Epoch: [1][ 541/2179] Time 2.477 ( 1.986) Data 2.320 ( 1.398) Loss 4.3038e+00 (4.4029e+00) Acc@1 18.37 ( 15.68) Acc@5 37.41 ( 35.19)
Epoch: [1][ 541/2179] Time 2.487 ( 1.989) Data 0.000 ( 1.045) Loss 4.0815e+00 (4.4078e+00) Acc@1 19.05 ( 15.57) Acc@5 42.18 ( 35.10)