Training and learning rate finder utilities

## 🚀 Feature
A function 
```
train_model(model, optimizer, scheduler, dataset, batch_size, shuffle=False, \
                snaphot_prefix = None, num_iters=None, \
                num_epochs=None, iter_size=1, display_iter = 20, snaphost_interval=None, \
                load_snapshot_path=None, restore_dataloader=True, display_gpu=False, schedule_on_iter=False)
```
that allows the user to easily train a model with a dataset without the hassle of coding himself the training loop. It contains the possibility to spread the batch size over several iterations `iter_size`, the possibility to schedule per epoch or per iteration `schedule_on_iter`, the possibility ot train for a set number of iterations or epochs and the possibility to save a reload snapshots containing the model, scheduler, and optimizer states including a few other parameters to properly restart from the snapshot (except for the dataloader since it does not have a state dict).
There is a requirement that the models contains its own loss function.
It does only training, not validation.

Something similar for a learning rate finder.
Both also available for multi node, multi gpu training with automated support for sync batch norm (with simple distributed configuration). It is using DistributedDataParallel with one process per GPU.

And a utility to easily configure learning rates for each parameter: `make_param_groups`.

## Motivation

Some of these features were easily available in caffe and we needed all of them for our work at Pixelz (the company I work for). We have it all implemented already (without unit tests though). We thought it could be useful to the community :-)

We didn't do the validation because we use the snapshots to do it outside of the training instances.

## Pitch
Is anyone interested in this?
Is the scope ok for a single pull request? Or should I make several issues and pull requests?
Can someone help to make the unit tests? (we don't need them ourselves and don't have tons of time available for opensourcing this code)
Does anyone want to add the validation to it?
In which module/submodule should it go?

## Additional context
Example usage:
Configure optimizer and learning rates
```
 custom_lr = { \
        'conv12.weight' : 1e-3, \
        'conv12.bias' : 1e-3  \
    }
    optimizer = optim.SGD(make_param_groups(model, custom_lr), lr=1e-2)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
   ```
configure multi gpu settings on a single node and train
```
    os.environ["RANK"] = "0"
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = "35003"
    os.environ["WORLD_SIZE"] = "1"
    
    model = train_model_multigpu(model, optimizer, scheduler, train_dataset, \
batch_size=6, shuffle=False, num_epochs=5, iter_size=2, schedule_on_iter=False, \
snaphost_interval=0.5, snaphot_prefix='/data/pixelz_train_6/snapshot')
```
So far we used it for single or multi gpu on a single node, with several computer vision architectures (VGG, resnet and variants, etc) without problem.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @vincentqb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training and learning rate finder utilities #25986

🚀 Feature

Motivation

Pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training and learning rate finder utilities #25986

Description

🚀 Feature

Motivation

Pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions