Migrate apex.parallel.SyncBatchNorm channels_last to pytorch #46906

xwang233 · 2020-10-27T09:51:31Z

per title

This PR did

Migrate apex.parallel.SyncBatchNorm channels_last to pytorch torch.nn.SyncBatchNorm

Fix a TODO here by fusing sum, div kernels into backward elementwise kernel

Lines 76 to 95 in b167402

    
           # TODO: move div_ into batch_norm_backward_elemt kernel 
        
           num_channels = sum_dy.shape[0] 
        
           combined = torch.cat([sum_dy, sum_dy_xmu], dim=0) 
        
           torch.distributed.all_reduce( 
        
               combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False) 
        
           sum_dy, sum_dy_xmu = torch.split(combined, num_channels) 
        
           divisor = count_tensor.sum() 
        
           mean_dy = sum_dy / divisor 
        
           mean_dy_xmu = sum_dy_xmu / divisor 
        
           # backward pass for gradient calculation 
        
           grad_input = torch.batch_norm_backward_elemt( 
        
               grad_output, 
        
               saved_input, 
        
               mean, 
        
               invstd, 
        
               weight, 
        
               mean_dy, 
        
               mean_dy_xmu 
        
           )

Todo

Discuss a regression introduced in SyncBatchNorm size check update #37133 (comment), which is the synchronized copy here

pytorch/torch/nn/modules/_functions.py

Lines 32 to 34 in b167402

    
           size = count_all.view(-1).long().sum() 
        
           if size == 1: 
        
               raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

Comment: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far.

The restriction to use channels_last kernel will be like this

inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) {
  return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2;
}

I think we can relax that for channels_last_3d as well?

Comment: we don't have benchmark for this now, will check this and add functionality later when needed.

Add test
Add benchmark

Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last

Close #50781

xwang233 · 2020-10-27T09:52:07Z

cc @ptrblck @jjsjann123 @ngimel

dr-ci · 2020-10-27T11:21:01Z

💊 CI failures summary and remediations

As of commit 4a993f4 (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 broken upstream at merge base a176c73 since Feb 25

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Mar 02 22:39:26 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Mar 02 22:39:26 At:
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(122): serialize
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(175): serialize
Mar 02 22:39:26 
Mar 02 22:39:26 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Mar 02 22:39:26 
Mar 02 22:39:26 At:
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(122): serialize
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(175): serialize
Mar 02 22:39:26 
Mar 02 22:39:26 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Mar 02 22:39:26 
Mar 02 22:39:26 At:
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(122): serialize
Mar 02 22:39:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(175): serialize
Mar 02 22:39:26 
Mar 02 22:39:26 ok (1.530s)
Mar 02 22:39:27   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... ok (1.529s)
Mar 02 22:39:29   test_return_local_rrefs (__main__.TensorPipeRpcTestWithSpawn) ... ok (1.530s)
Mar 02 22:39:35   test_rpc_profiling_async_function (__main__.TensorPipeRpcTestWithSpawn) ... ok (6.137s)
Mar 02 22:39:41   test_rpc_profiling_async_function_single_threaded (__main__.TensorPipeRpcTestWithSpawn) ... ok (5.937s)

1 job timed out:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet.

binary_macos_wheel_3_7_cpu_build since Feb 25 (fdd25f8)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

ngimel · 2020-10-27T16:59:47Z

Cool, can you post benchmarks comparing to apex?

pritamdamania87 · 2020-10-29T02:02:18Z

@lly-zero-one Would it be possible to test this PR with some of our ClassyVision workflows to see the potential benefit?

xwang233 · 2020-10-30T10:40:32Z

The detailed benchmark and raw data is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last.

For 2D and 4D tensors on V100 x8 (relative perf is similar on A100 x8), the kernel execution time (not including NCCL reduction/gather, kernel launch overhead, or tensor memory format transformation):

new channels_last vs master contiguous

new channels_last vs apex channels_last

facebook-github-bot · 2020-10-30T17:31:01Z

Hi @xwang233!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

lly-zero-one · 2020-10-30T18:15:36Z

@lly-zero-one Would it be possible to test this PR with some of our ClassyVision workflows to see the potential benefit?

@pritamdamania87 In ClassyVision flow, it is using the Apex. Maybe we should ask CV team to change their flow.

xwang233 · 2020-11-03T01:41:05Z

@ngimel The CLA is ready.

VitalyFedyunin · 2021-01-15T18:10:39Z

Hi! Can you please rebase, thanks

facebook-github-bot · 2021-01-20T22:33:21Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

…nnels-last

ngimel

lgtm, let's wait for the tests

ngimel · 2021-03-01T23:47:14Z

bc compat error is real

xwang233 · 2021-03-01T23:53:15Z

bc compat error is real

Yes, it is intentional. We fused mean calculations of sum / num_channels into the apply normalization kernel, so that reduces the number of kernels launched.

ngimel · 2021-03-01T23:54:20Z

I understand, but then you should add it to exceptions in bc compat test

…nnels-last

ngimel · 2021-03-02T19:55:55Z

I don't want to delay this PR, but consider making functions like batch_norm_backward_elemt private by renaming to _batch_norm_backward_elemt in a follow up.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-03T23:31:51Z

@malfet merged this pull request in d30f4d1.

osalpekar · 2021-03-04T01:59:54Z

Hey @xwang233 @ngimel , it looks like the pytorch_linux_backward_compatibility_check_test test has been failing on master since this PR was merged. Have the appropriate exceptions been added to the bc compat test in this PR?

xwang233 · 2021-03-04T02:02:53Z

Yes, bc compact exception was added here c6c680a

The error message you saw was due to a revert of the previous commit in master branch before this one

Mar 04 00:53:49 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
Mar 04 00:53:49 
Mar 04 00:53:49 Broken ops: [
Mar 04 00:53:49 	aten::_lstsq_helper(Tensor a, Tensor b, float cond, str? driver_name) -> (Tensor, Tensor, Tensor)
Mar 04 00:53:49 	aten::linalg_lstsq(Tensor self, Tensor b, float? cond=None, *, str? driver=None) -> (Tensor solution, Tensor residuals, Tensor rank, Tensor singular_values)
Mar 04 00:53:49 ]

…#46906) Summary: per title This PR did - Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm` - Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L76-L95 Todo - [x] Discuss a regression introduced in pytorch#37133 (comment), which is the synchronized copy here https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L32-L34 **Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far. - [x] The restriction to use channels_last kernel will be like this ``` inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) { return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2; } ``` I think we can relax that for channels_last_3d as well? **Comment**: we don't have benchmark for this now, will check this and add functionality later when needed. - [x] Add test - [x] Add benchmark Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last Close pytorch#50781 Pull Request resolved: pytorch#46906 Reviewed By: albanD Differential Revision: D26771437 Pulled By: malfet fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de

This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here #88021 (comment), on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of #46906 Close #88021 Pull Request resolved: #88401 Approved by: https://github.com/ngimel

This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here pytorch#88021 (comment), on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of pytorch#46906 Close pytorch#88021 Pull Request resolved: pytorch#88401 Approved by: https://github.com/ngimel

xwang233 added 7 commits October 26, 2020 17:07

batch_norm_stats channels_last

91ea94a

batch_norm_forward_elemt channels_last

ad3dc53

error message

453113b

batch_norm_backward_reduce channels_last

7703e66

batchnorm_backward_elemt channels_last

5582535

rename kernels

6bee03d

update size check pytorch#37133

88444f9

xwang233 requested a review from apaszke as a code owner October 27, 2020 09:51

pytorchbot added the open source label Oct 27, 2020

xwang233 added 2 commits October 28, 2020 16:10

fix test_override

d4f1c03

sync bn channels-last test

b0852ad

xwang233 requested review from mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 29, 2020 01:17

output

feabdaf

xwang233 changed the title ~~[WIP] Migrate apex.parallel.SyncBatchNorm channels_last to pytorch~~ Migrate apex.parallel.SyncBatchNorm channels_last to pytorch Oct 30, 2020

albanD requested a review from ngimel October 30, 2020 13:37

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2020

facebook-github-bot added the cla signed label Nov 2, 2020

melgor mentioned this pull request Jan 14, 2021

Channels last doesn't improve speed when using SyncBatchNorm #50549

Open

xwang233 added 3 commits January 22, 2021 00:51

Merge remote-tracking branch 'upstream/master' into syncbn-channels-last

114e903

Merge remote-tracking branch 'upstream/master' into syncbn-channels-last

8f88f64

weight contiguous

3cfab5c

xwang233 requested review from albanD and jbschlosser as code owners March 1, 2021 19:53

Merge remote-tracking branch 'upstream/viable/strict' into syncbn-cha…

3a80580

…nnels-last

ngimel approved these changes Mar 1, 2021

View reviewed changes

xwang233 added 2 commits March 1, 2021 16:13

bc compat check

c6c680a

Merge remote-tracking branch 'upstream/viable/strict' into syncbn-cha…

4a993f4

…nnels-last

facebook-github-bot reviewed Mar 3, 2021

View reviewed changes

facebook-github-bot closed this in d30f4d1 Mar 3, 2021

facebook-github-bot added the Merged label Mar 3, 2021

xwang233 mentioned this pull request Mar 23, 2021

SyncBatchNorm raises exception when affine=False #54495

Closed

ngimel mentioned this pull request May 8, 2021

Torch nightly fails calling backward() #57900

Closed

EnricoMi mentioned this pull request May 10, 2021

Un-pin torch-nightly horovod/horovod#2829

Merged

myron mentioned this pull request Oct 28, 2022

SyncBatchNorm 3D is slow with torch.channels_last_3d format in DDP on GPU #88021

Closed

xwang233 mentioned this pull request Nov 3, 2022

Enable channels_last_3d on SyncBatchNorm #88401

Closed

	# TODO: move div_ into batch_norm_backward_elemt kernel
	num_channels = sum_dy.shape[0]
	combined = torch.cat([sum_dy, sum_dy_xmu], dim=0)
	torch.distributed.all_reduce(
	combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
	sum_dy, sum_dy_xmu = torch.split(combined, num_channels)

	divisor = count_tensor.sum()
	mean_dy = sum_dy / divisor
	mean_dy_xmu = sum_dy_xmu / divisor
	# backward pass for gradient calculation
	grad_input = torch.batch_norm_backward_elemt(
	grad_output,
	saved_input,
	mean,
	invstd,
	weight,
	mean_dy,
	mean_dy_xmu
	)

	size = count_all.view(-1).long().sum()
	if size == 1:
	raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

Migrate apex.parallel.SyncBatchNorm channels_last to pytorch #46906

Migrate apex.parallel.SyncBatchNorm channels_last to pytorch #46906

Uh oh!

Conversation

xwang233 commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Oct 27, 2020

Uh oh!

dr-ci bot commented Oct 27, 2020 • edited by facebook-github-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (1/1)

🚧 1 ongoing upstream failure:

Uh oh!

ngimel commented Oct 27, 2020

Uh oh!

pritamdamania87 commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Oct 30, 2020

Uh oh!

facebook-github-bot commented Oct 30, 2020

Uh oh!

lly-zero-one commented Oct 30, 2020

Uh oh!

xwang233 commented Nov 3, 2020

Uh oh!

VitalyFedyunin commented Jan 15, 2021

Uh oh!

facebook-github-bot commented Jan 20, 2021

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Mar 1, 2021

Uh oh!

xwang233 commented Mar 1, 2021

Uh oh!

ngimel commented Mar 1, 2021

Uh oh!

ngimel commented Mar 2, 2021

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 3, 2021

Uh oh!

osalpekar commented Mar 4, 2021

Uh oh!

xwang233 commented Mar 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

xwang233 commented Oct 27, 2020 •

edited

Loading

dr-ci bot commented Oct 27, 2020 •

edited by facebook-github-bot

Loading

pritamdamania87 commented Oct 29, 2020 •

edited

Loading