Skip to content

Training and learning rate finder utilities #25986

@SebastienEske

Description

@SebastienEske

🚀 Feature

A function

train_model(model, optimizer, scheduler, dataset, batch_size, shuffle=False, \
                snaphot_prefix = None, num_iters=None, \
                num_epochs=None, iter_size=1, display_iter = 20, snaphost_interval=None, \
                load_snapshot_path=None, restore_dataloader=True, display_gpu=False, schedule_on_iter=False)

that allows the user to easily train a model with a dataset without the hassle of coding himself the training loop. It contains the possibility to spread the batch size over several iterations iter_size, the possibility to schedule per epoch or per iteration schedule_on_iter, the possibility ot train for a set number of iterations or epochs and the possibility to save a reload snapshots containing the model, scheduler, and optimizer states including a few other parameters to properly restart from the snapshot (except for the dataloader since it does not have a state dict).
There is a requirement that the models contains its own loss function.
It does only training, not validation.

Something similar for a learning rate finder.
Both also available for multi node, multi gpu training with automated support for sync batch norm (with simple distributed configuration). It is using DistributedDataParallel with one process per GPU.

And a utility to easily configure learning rates for each parameter: make_param_groups.

Motivation

Some of these features were easily available in caffe and we needed all of them for our work at Pixelz (the company I work for). We have it all implemented already (without unit tests though). We thought it could be useful to the community :-)

We didn't do the validation because we use the snapshots to do it outside of the training instances.

Pitch

Is anyone interested in this?
Is the scope ok for a single pull request? Or should I make several issues and pull requests?
Can someone help to make the unit tests? (we don't need them ourselves and don't have tons of time available for opensourcing this code)
Does anyone want to add the validation to it?
In which module/submodule should it go?

Additional context

Example usage:
Configure optimizer and learning rates

 custom_lr = { \
        'conv12.weight' : 1e-3, \
        'conv12.bias' : 1e-3  \
    }
    optimizer = optim.SGD(make_param_groups(model, custom_lr), lr=1e-2)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

configure multi gpu settings on a single node and train

    os.environ["RANK"] = "0"
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = "35003"
    os.environ["WORLD_SIZE"] = "1"
    
    model = train_model_multigpu(model, optimizer, scheduler, train_dataset, \
batch_size=6, shuffle=False, num_epochs=5, iter_size=2, schedule_on_iter=False, \
snaphost_interval=0.5, snaphot_prefix='/data/pixelz_train_6/snapshot')

So far we used it for single or multi gpu on a single node, with several computer vision architectures (VGG, resnet and variants, etc) without problem.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @vincentqb

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: optimizerRelated to torch.optimneeds researchWe need to decide whether or not this merits inclusion, based on research worldoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions