Skip to content

Conversation

@unlimblue
Copy link
Contributor

@unlimblue unlimblue commented Jun 26, 2019

Fix #22192

  • change signature: func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)
  • change cuda & cuda head
std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
  • change python interface
class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn module: operators labels Jun 26, 2019
@mrshenli mrshenli self-requested a review June 26, 2019 03:44
@mrshenli
Copy link
Contributor

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@unlimblue
Copy link
Contributor Author

@mrshenli pytorch_xla_linux_trusty_py3_6_gcc5_4_build could retest again.

@unlimblue unlimblue changed the title Fix https://github.com/pytorch/pytorch/issues/22192 Fix SyncBatchNorm running var update issue Jun 26, 2019
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! Can you add a test in test/test_distributed.py to cover different input sizes?

CUDA: batch_norm_elemt_cuda

- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor)
- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause any backward compatibility issue? It's true that this method is only used in torch/nn/modules/_functions.py in PyTorch, but we do expose it as an API. I am a little worried it is used in other projects. @gchanan is it OK to break BC here? Or should we keep both API where a single count arg is automatically expanded to a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy that, and is it possible change const Tensor& counts to const ArrayRef<index_t> counts? I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @hux999 is trying to fix Tensor counts issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.

Can you try using the from_blob API (learnt from @yf225 ).
See forum post

@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2019
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 27, 2019
@mrshenli
Copy link
Contributor

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@unlimblue
Copy link
Contributor Author

We have found the reason, and trying to fix that.

@pytorchbot pytorchbot added caffe2 module: infra Relates to CI infrastructure labels Jul 1, 2019
@pytorchbot pytorchbot added module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries module: third_party labels Jul 1, 2019
@unlimblue
Copy link
Contributor Author

@mrshenli are there some ci cache issue?

Jul 01 06:22:58 ======================================================================
Jul 01 06:22:58 ERROR: test_clip_grad_norm (__main__.TestNN)
Jul 01 06:22:58 ----------------------------------------------------------------------
Jul 01 06:22:58 Traceback (most recent call last):
Jul 01 06:22:58   File "test_nn.py", line 1829, in test_clip_grad_norm
Jul 01 06:22:58     norm = clip_grad_norm_(l.parameters(), max_norm, norm_type=norm_type)
Jul 01 06:22:58   File "/opt/conda/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in clip_grad_norm_
Jul 01 06:22:58     clip_coef = torch.tensor(max_norm, device=device) / (total_norm + 1e-6)
Jul 01 06:22:58 UnboundLocalError: local variable 'device' referenced before assignment
Jul 01 06:22:58 
Jul 01 06:22:58 ----------------------------------------------------------------------

@mrshenli
Copy link
Contributor

mrshenli commented Jul 1, 2019

@unlimblue that's not your fault. Other PR (e.g., #22392) also hit this error.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

CUDA: batch_norm_elemt_cuda

- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor)
- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int[] counts) -> (Tensor, Tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is still BC-breaking. As mentioned above, can you try keep both the original one and the new one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli you should probably also deprecate the old one so we can get rid of it in the future.

@ssnl
Copy link
Collaborator

ssnl commented Jul 1, 2019

There is an issue on the test error: #22052. Ignore it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli
Copy link
Contributor

mrshenli commented Jul 2, 2019

@pytorchbot rebase this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 4, 2019
Summary:
## Fix pytorch/pytorch#22192

+ change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)`
+ change cuda & cuda head
```cuda
std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
+ change python interface
```python
class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...
```
Pull Request resolved: pytorch/pytorch#22248

Differential Revision: D16002146

Pulled By: mrshenli

fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 29ec476.

xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this pull request Jul 5, 2019
Summary:
## Fix pytorch#22192

+ change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)`
+ change cuda & cuda head
```cuda
std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
+ change python interface
```python
class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...
```
Pull Request resolved: pytorch#22248

Differential Revision: D16002146

Pulled By: mrshenli

fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
@jjsjann123 jjsjann123 mentioned this pull request Apr 10, 2020
facebook-github-bot pushed a commit that referenced this pull request Apr 14, 2020
Summary:
Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: #36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
gchanan pushed a commit to gchanan/pytorch that referenced this pull request Apr 15, 2020
Summary:
Previous PR pytorch#22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: pytorch#36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
gchanan pushed a commit that referenced this pull request Apr 16, 2020
Summary:
Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: #36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

caffe2 Merged module: cuda Related to torch.cuda, and CUDA support in general module: infra Relates to CI infrastructure module: nn Related to torch.nn module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries module: third_party oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sync Batchnorm running var update issue

9 participants