-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
PyTorch version : 0.3.0
I use adam or sgd with the default parameters to training my network. But I found some interesing results during the training restart. Loss function is torch.nn.mse.
Algorithm sgd :
First I training it with 110 batches, and save the model with torch.save(model.state_dict(), self.fn) at 100-th batch.
If I training continuously with all 110 batches, the loss I got will be the following :
2018-03-08 09:37:13 - root - INFO - SNR: 3dB Epoch[0].100 : loss 1.2919E-03, err_num=31756
2018-03-08 09:39:16 - root - INFO - SNR: 3dB Epoch[0].101 : loss 1.2755E-03, err_num=31606
2018-03-08 09:39:17 - root - INFO - SNR: 3dB Epoch[0].102 : loss 1.1193E-03, err_num=26931
2018-03-08 09:39:18 - root - INFO - SNR: 3dB Epoch[0].103 : loss 1.1842E-03, err_num=28907
2018-03-08 09:39:19 - root - INFO - SNR: 3dB Epoch[0].104 : loss 1.3281E-03, err_num=32834
If I restart the training and reload the model with model.load_state_dict( torch.load(self.fn) ) at 100-th batch, then the loss I got will be the following :
2018-03-08 09:40:51 - root - INFO - SNR: 3dB Epoch[0].100 : loss 1.2919E-03, err_num=31756
2018-03-08 09:41:03 - root - INFO - SNR: 3dB Epoch[0].101 : loss 1.2755E-03, err_num=31606
2018-03-08 09:41:04 - root - INFO - SNR: 3dB Epoch[0].102 : loss 1.1193E-03, err_num=26931
2018-03-08 09:41:06 - root - INFO - SNR: 3dB Epoch[0].103 : loss 1.1842E-03, err_num=28914
2018-03-08 09:41:07 - root - INFO - SNR: 3dB Epoch[0].104 : loss 1.3282E-03, err_num=32833
for adam
continuous training
2018-03-06 14:50:10 - root - INFO - Epoch[0].100 : loss 0.001053, err_num=26686.0
2018-03-06 14:50:11 - root - INFO - Epoch[0].101 : loss 0.001018, err_num=25701.0
2018-03-06 14:50:12 - root - INFO - Epoch[0].102 : loss 0.0008684, err_num=21734.0
2018-03-06 14:50:13 - root - INFO - Epoch[0].103 : loss 0.00094, err_num=23426.0
2018-03-06 14:50:14 - root - INFO - Epoch[0].104 : loss 0.001076, err_num=26678.0
restart training
2018-03-08 11:10:09 - root - INFO - Epoch[0].100 : loss 1.0526E-03, err_num=26686
2018-03-08 11:10:11 - root - INFO - Epoch[0].101 : loss 1.0135E-03, err_num=25711
2018-03-08 11:10:13 - root - INFO - Epoch[0].102 : loss 8.6998E-04, err_num=21581
2018-03-08 11:10:14 - root - INFO - Epoch[0].103 : loss 9.3137E-04, err_num=23466
2018-03-08 11:10:15 - root - INFO - Epoch[0].104 : loss 1.0608E-03, err_num=26306
2018-03-08 11:10:16 - root - INFO - Epoch[0].105 : loss 8.5917E-04, err_num=21298
It can be seen that loss of restart training is not same as the training with continuous training. I double check the dataset of these two case, which is all the same. Why? How did it happen?
If I use the adam alogorithm, loss and err_num will be differenet starting from Epoch[0].101.