-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 Feature
PyTorch now recommends to use DistributedDataParallel over DataParallel for all sorts of multi-GPU trainings (#35063). However, it has one limitation comparing to old DataParallel module - currently it cannot handle forward/backward hooks in a user convenient way.
Proposed workaround
pytorch/torch/nn/parallel/distributed.py
Lines 146 to 149 in 95ad94c
| .. warning:: | |
| Forward and backward hooks defined on :attr:`module` and its submodules | |
| won't be invoked anymore, unless the hooks are initialized in the | |
| :meth:`forward` method. |
requires users to edit each model's forward propagation code for using hooks with model wrapped into DDP.
As I understand, it wasn't initially designed having this limitation in mind and was discovered during fixing another issue #5061. So, I am wondering, maybe there are some possibilities to implement some sort of hook synchronization mechanism across distributed model replicas?
Motivation
Also with current workaround possibilities to use hooks dynamically is lost for DistributedDataParallel module. For example, in my current code with DataParallel I am able to place and remove hooks dynamically: during validation phase of training process I am placing hooks to extract additional bottleneck features to calculate some complementary evaluation metrics which are not calculated during training phase.
In general, current hooking mechanism looks not fully compatible with DDP.
Pitch
Hooking mechanism for DistributedDataParallel module working from the user perspective as in DataParallel module.