Can not restart the training to obtain the same results

PyTorch version : 0.3.0

I use adam or sgd with the default parameters to training my network. But I found some interesing results during the training restart. Loss function is torch.nn.mse.

Algorithm sgd  : 
 First I training it with 110 batches, and save the model with torch.save(model.state_dict(), self.fn) at 100-th batch.

If I training continuously with all 110 batches, the loss I got will be the following :

2018-03-08 09:37:13 - root - INFO - SNR: 3dB Epoch[0].100  :     loss 1.2919E-03,   err_num=31756  
2018-03-08 09:39:16 - root - INFO - SNR: 3dB Epoch[0].101  :     loss 1.2755E-03,   err_num=31606   
2018-03-08 09:39:17 - root - INFO - SNR: 3dB Epoch[0].102  :     loss 1.1193E-03,   err_num=26931 
2018-03-08 09:39:18 - root - INFO - SNR: 3dB Epoch[0].103  :     loss 1.1842E-03,   err_num=28907  
2018-03-08 09:39:19 - root - INFO - SNR: 3dB Epoch[0].104  :     loss 1.3281E-03,   err_num=32834 

If I restart the training and reload the model with model.load_state_dict( torch.load(self.fn) ) at 100-th batch, then the loss I got will be the following : 

2018-03-08 09:40:51 - root - INFO - SNR: 3dB Epoch[0].100  :     loss 1.2919E-03,   err_num=31756 
2018-03-08 09:41:03 - root - INFO - SNR: 3dB Epoch[0].101  :     loss 1.2755E-03,   err_num=31606 
2018-03-08 09:41:04 - root - INFO - SNR: 3dB Epoch[0].102  :     loss 1.1193E-03,   err_num=26931 
2018-03-08 09:41:06 - root - INFO - SNR: 3dB Epoch[0].103  :     loss 1.1842E-03,   err_num=28914  
2018-03-08 09:41:07 - root - INFO - SNR: 3dB Epoch[0].104  :     loss 1.3282E-03,   err_num=32833 

for adam 
continuous training
2018-03-06 14:50:10 - root - INFO -  Epoch[0].100  :     loss 0.001053,   err_num=26686.0 
2018-03-06 14:50:11 - root - INFO -  Epoch[0].101  :     loss 0.001018,   err_num=25701.0 
2018-03-06 14:50:12 - root - INFO -  Epoch[0].102  :     loss 0.0008684,   err_num=21734.0  
2018-03-06 14:50:13 - root - INFO -  Epoch[0].103  :     loss 0.00094,   err_num=23426.0  
2018-03-06 14:50:14 - root - INFO -  Epoch[0].104  :     loss 0.001076,   err_num=26678.0  

restart training
2018-03-08 11:10:09 - root - INFO -  Epoch[0].100  :     loss 1.0526E-03,   err_num=26686  
2018-03-08 11:10:11 - root - INFO -  Epoch[0].101  :     loss 1.0135E-03,   err_num=25711  
2018-03-08 11:10:13 - root - INFO -  Epoch[0].102  :     loss 8.6998E-04,   err_num=21581  
2018-03-08 11:10:14 - root - INFO -  Epoch[0].103  :     loss 9.3137E-04,   err_num=23466  
2018-03-08 11:10:15 - root - INFO -  Epoch[0].104  :     loss 1.0608E-03,   err_num=26306  
2018-03-08 11:10:16 - root - INFO -  Epoch[0].105  :     loss 8.5917E-04,   err_num=21298 

It can be seen that loss of restart training is not same as the training with continuous training. I double check the dataset of these two case, which is all the same. Why? How did it happen?

If I use the adam alogorithm, loss and err_num will be differenet starting from Epoch[0].101.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can not restart the training to obtain the same results #5627

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can not restart the training to obtain the same results #5627

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions