Skip to content

Conversation

@yf225
Copy link
Contributor

@yf225 yf225 commented May 28, 2019

#17072 breaks model.to(xla_device), because moving model to XLA device involves changing its parameters' TensorImpl type, and the current implementation of nn.Module.to() doesn't support changing module parameters' TensorImpl type:

# https://github.com/pytorch/pytorch/blob/6dc445e1a84dc5d093d640de54f038f021d13227/torch/nn/modules/module.py#L192-L208
def _apply(self, fn):
    ...
    for param in self._parameters.values():
        if param is not None:
            # Tensors stored in modules are graph leaves, and we don't
            # want to create copy nodes, so we have to unpack the data.
            param.data = fn(param.data)  # NOTE: this doesn't allow changing `param.data`'s TensorImpl type
            if param._grad is not None:
                param._grad.data = fn(param._grad.data)  # NOTE: this doesn't allow changing `param._grad.data`'s TensorImpl type
   ...

A hypothetical way to fix this is to do the following:

def _apply(self, fn):
    ...
    for key, param in self._parameters.items():
        if param is not None:
            # Tensors stored in modules are graph leaves, and we don't
            # want to create copy nodes, so we have to unpack the data.
            param_applied = fn(param)
            if param._is_same_impl_type(param_applied):
                param.data = param_applied
            else:  # If we have to change TensorImpl type...
                with torch.no_grad():
                    # We use `requires_grad_()` here, to make sure the new `param` still
                    # has the same `requires_grad` value as the old `param`. An alternative is
                    # to not use `with torch.no_grad():`, but that would cause the following operation
                    # to create a `CopyBackwards` gradient function which is not what we wanted.
                    self._parameters[key] = param_applied.requires_grad_(param.requires_grad)
            if param._grad is not None:
                grad_applied = fn(param._grad)
                if param._grad._is_same_impl_type(grad_applied):
                    param._grad.data = grad_applied
                else:  # If we have to change TensorImpl type...
                    self._parameters[key]._grad = grad_applied
   ...

However, the biggest problem of this approach is that it makes the model.to(device) API less predictable: if we are moving model from CPU to CUDA, all previous references to param are preserved because we use param.data = fn(param.data); however, if we are moving model from CPU to XLA device, all previous references to param are broken because we are assigning new tensors to param. In order to preserve model.to(device) API consistency, we will be changing CPU-CUDA model moving code to also break previous references to the model's parameters, and this will happen in two stages, first as a deprecation notice if we detect previous references to those parameters, second as a hard error (in future releases).

cc. @ailzhang

TODO:
Add explanations why the following cases are no longer supported:

@pytorchbot pytorchbot added module: autograd Related to torch.autograd, and the autograd engine in general module: internals Related to internal abstractions in c10 and ATen module: nn Related to torch.nn module: pybind Related to our Python bindings / interactions with other Python libraries labels May 28, 2019
@yf225 yf225 force-pushed the fix_xla branch 7 times, most recently from f75249a to 6cc1468 Compare May 29, 2019 01:52
Will Feng added 2 commits May 29, 2019 09:20
@yf225 yf225 changed the title [WIP] Fix XLA test Preserve previous references to model.param when we call model.to(xla_device) May 30, 2019
@yf225 yf225 changed the title Preserve previous references to model.param when we call model.to(xla_device) Fix model.to(xla_device) May 30, 2019
@ezyang ezyang self-requested a review May 30, 2019 20:55
@yf225 yf225 force-pushed the fix_xla branch 18 times, most recently from 81ffc4b to 9f1ac9c Compare June 7, 2019 19:28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just make this a native function? Then you wouldn't need to make your own parsing either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment shoudl refer to is_same_impl_type.

test/test_nn.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be cuda()?

test/test_nn.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please comment what this is attempting to test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should warn yet: there is no alternative they can use yet if they really want to hold on to a reference to another tensor. We need new APIs that we can direct people to first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it possible to name the type of module? It might not be obvious because you can move an entire model and a param in some sub module changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please break up this PR? This is trying to do a bunch of different things:

  1. Fix moving to XLA
  2. Do proper version tracking of module parameters (sometimes)
  3. Warn about future breaking changes (without first introducing correct APIs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't this follow the other pattern of using no_grad and setting requires_grad at the end?

with torch.no_grad():
# We use `.requires_grad_()` here to make sure the new `param` still
# has the same `requires_grad` value as the old `param`.
self._parameters[key] = param_applied.requires_grad_(param.requires_grad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, won't the parameter not even be an nn.Parameter anymore?

@yf225 yf225 closed this Jun 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: autograd Related to torch.autograd, and the autograd engine in general module: cpp Related to C++ API module: internals Related to internal abstractions in c10 and ATen module: nn Related to torch.nn module: pybind Related to our Python bindings / interactions with other Python libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants