Skip to content

Forward/backward hooks support in DistributedDataParallel #35191

@h6197627

Description

@h6197627

🚀 Feature

PyTorch now recommends to use DistributedDataParallel over DataParallel for all sorts of multi-GPU trainings (#35063). However, it has one limitation comparing to old DataParallel module - currently it cannot handle forward/backward hooks in a user convenient way.
Proposed workaround

.. warning::
Forward and backward hooks defined on :attr:`module` and its submodules
won't be invoked anymore, unless the hooks are initialized in the
:meth:`forward` method.

requires users to edit each model's forward propagation code for using hooks with model wrapped into DDP.
As I understand, it wasn't initially designed having this limitation in mind and was discovered during fixing another issue #5061. So, I am wondering, maybe there are some possibilities to implement some sort of hook synchronization mechanism across distributed model replicas?

Motivation

Also with current workaround possibilities to use hooks dynamically is lost for DistributedDataParallel module. For example, in my current code with DataParallel I am able to place and remove hooks dynamically: during validation phase of training process I am placing hooks to extract additional bottleneck features to calculate some complementary evaluation metrics which are not calculated during training phase.
In general, current hooking mechanism looks not fully compatible with DDP.

Pitch

Hooking mechanism for DistributedDataParallel module working from the user perspective as in DataParallel module.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: data paralleltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions