PyTorch:为什么训练迭代在每个时期都是重复的

问题描述 投票:0回答:0

我正在通过多节点训练设置在 ImageNeT-1k 数据集上训练 ResNet-50 模型。在每个epoch中,有一些步骤重复两次,需要很长时间来训练。这是什么原因?

代码库: https://github.com/pytorch/examples/tree/main/imagenet

训练日志: Look 501、511、521

Epoch: [1][ 491/2179]   Time  3.045 ( 1.984)    Data  0.297 ( 1.124)    Loss 4.3550e+00 (4.4256e+00)    Acc@1  18.37 ( 15.31)   Acc@5  36.05 ( 34.77)
Epoch: [1][ 501/2179]   Time  0.568 ( 1.979)    Data  0.408 ( 1.362)    Loss 4.4685e+00 (4.4136e+00)    Acc@1  16.33 ( 15.58)   Acc@5  37.41 ( 35.00)
Epoch: [1][ 501/2179]   Time  0.567 ( 1.982)    Data  0.015 ( 1.110)    Loss 4.2383e+00 (4.4213e+00)    Acc@1  15.65 ( 15.34)   Acc@5  38.78 ( 34.86)
Epoch: [1][ 511/2179]   Time  2.051 ( 1.983)    Data  1.890 ( 1.374)    Loss 4.2469e+00 (4.4096e+00)    Acc@1  16.33 ( 15.61)   Acc@5  36.05 ( 35.06)
Epoch: [1][ 511/2179]   Time  2.053 ( 1.986)    Data  0.000 ( 1.098)    Loss 4.6688e+00 (4.4205e+00)    Acc@1  15.65 ( 15.37)   Acc@5  33.33 ( 34.86)
Epoch: [1][ 521/2179]   Time  4.211 ( 1.987)    Data  4.049 ( 1.387)    Loss 4.5349e+00 (4.4069e+00)    Acc@1  17.69 ( 15.63)   Acc@5  34.69 ( 35.13)
Epoch: [1][ 521/2179]   Time  4.211 ( 1.990)    Data  0.000 ( 1.083)    Loss 4.2325e+00 (4.4171e+00)    Acc@1  12.93 ( 15.41)   Acc@5  34.01 ( 34.91)
Epoch: [1][ 531/2179]   Time  3.350 ( 1.992)    Data  0.000 ( 1.064)    Loss 4.0711e+00 (4.4129e+00)    Acc@1  21.09 ( 15.49)   Acc@5  42.18 ( 35.00)
Epoch: [1][ 531/2179]   Time  3.346 ( 1.989)    Data  3.188 ( 1.395)    Loss 4.4750e+00 (4.4056e+00)    Acc@1  13.61 ( 15.64)   Acc@5  34.69 ( 35.14)
Epoch: [1][ 541/2179]   Time  2.477 ( 1.986)    Data  2.320 ( 1.398)    Loss 4.3038e+00 (4.4029e+00)    Acc@1  18.37 ( 15.68)   Acc@5  37.41 ( 35.19)
Epoch: [1][ 541/2179]   Time  2.487 ( 1.989)    Data  0.000 ( 1.045)    Loss 4.0815e+00 (4.4078e+00)    Acc@1  19.05 ( 15.57)   Acc@5  42.18 ( 35.10)
pytorch neural-network torch training-data imagenet
© www.soinside.com 2019 - 2024. All rights reserved.