-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 Feature
A function
train_model(model, optimizer, scheduler, dataset, batch_size, shuffle=False, \
snaphot_prefix = None, num_iters=None, \
num_epochs=None, iter_size=1, display_iter = 20, snaphost_interval=None, \
load_snapshot_path=None, restore_dataloader=True, display_gpu=False, schedule_on_iter=False)
that allows the user to easily train a model with a dataset without the hassle of coding himself the training loop. It contains the possibility to spread the batch size over several iterations iter_size, the possibility to schedule per epoch or per iteration schedule_on_iter, the possibility ot train for a set number of iterations or epochs and the possibility to save a reload snapshots containing the model, scheduler, and optimizer states including a few other parameters to properly restart from the snapshot (except for the dataloader since it does not have a state dict).
There is a requirement that the models contains its own loss function.
It does only training, not validation.
Something similar for a learning rate finder.
Both also available for multi node, multi gpu training with automated support for sync batch norm (with simple distributed configuration). It is using DistributedDataParallel with one process per GPU.
And a utility to easily configure learning rates for each parameter: make_param_groups.
Motivation
Some of these features were easily available in caffe and we needed all of them for our work at Pixelz (the company I work for). We have it all implemented already (without unit tests though). We thought it could be useful to the community :-)
We didn't do the validation because we use the snapshots to do it outside of the training instances.
Pitch
Is anyone interested in this?
Is the scope ok for a single pull request? Or should I make several issues and pull requests?
Can someone help to make the unit tests? (we don't need them ourselves and don't have tons of time available for opensourcing this code)
Does anyone want to add the validation to it?
In which module/submodule should it go?
Additional context
Example usage:
Configure optimizer and learning rates
custom_lr = { \
'conv12.weight' : 1e-3, \
'conv12.bias' : 1e-3 \
}
optimizer = optim.SGD(make_param_groups(model, custom_lr), lr=1e-2)
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
configure multi gpu settings on a single node and train
os.environ["RANK"] = "0"
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "35003"
os.environ["WORLD_SIZE"] = "1"
model = train_model_multigpu(model, optimizer, scheduler, train_dataset, \
batch_size=6, shuffle=False, num_epochs=5, iter_size=2, schedule_on_iter=False, \
snaphost_interval=0.5, snaphot_prefix='/data/pixelz_train_6/snapshot')
So far we used it for single or multi gpu on a single node, with several computer vision architectures (VGG, resnet and variants, etc) without problem.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @vincentqb