-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Migrate apex.parallel.SyncBatchNorm channels_last to pytorch #46906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 4a993f4 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
Cool, can you post benchmarks comparing to apex? |
|
@lly-zero-one Would it be possible to test this PR with some of our ClassyVision workflows to see the potential benefit? |
|
The detailed benchmark and raw data is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last. For 2D and 4D tensors on V100 x8 (relative perf is similar on A100 x8), the kernel execution time (not including NCCL reduction/gather, kernel launch overhead, or tensor memory format transformation): |
|
Hi @xwang233! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
@pritamdamania87 In ClassyVision flow, it is using the Apex. Maybe we should ask CV team to change their flow. |
|
@ngimel The CLA is ready. |
|
Hi! Can you please rebase, thanks |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, let's wait for the tests
|
bc compat error is real |
Yes, it is intentional. We fused mean calculations of |
|
I understand, but then you should add it to exceptions in bc compat test |
|
I don't want to delay this PR, but consider making functions like |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Yes, bc compact exception was added here c6c680a The error message you saw was due to a revert of the previous commit in master branch before this one |
…#46906) Summary: per title This PR did - Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm` - Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L76-L95 Todo - [x] Discuss a regression introduced in pytorch#37133 (comment), which is the synchronized copy here https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L32-L34 **Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far. - [x] The restriction to use channels_last kernel will be like this ``` inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) { return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2; } ``` I think we can relax that for channels_last_3d as well? **Comment**: we don't have benchmark for this now, will check this and add functionality later when needed. - [x] Add test - [x] Add benchmark Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last Close pytorch#50781 Pull Request resolved: pytorch#46906 Reviewed By: albanD Differential Revision: D26771437 Pulled By: malfet fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de
…#46906) Summary: per title This PR did - Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm` - Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L76-L95 Todo - [x] Discuss a regression introduced in pytorch#37133 (comment), which is the synchronized copy here https://github.com/pytorch/pytorch/blob/b167402e2e66a663cd9913885552929b4c045ffa/torch/nn/modules/_functions.py#L32-L34 **Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far. - [x] The restriction to use channels_last kernel will be like this ``` inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) { return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2; } ``` I think we can relax that for channels_last_3d as well? **Comment**: we don't have benchmark for this now, will check this and add functionality later when needed. - [x] Add test - [x] Add benchmark Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last Close pytorch#50781 Pull Request resolved: pytorch#46906 Reviewed By: albanD Differential Revision: D26771437 Pulled By: malfet fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de
This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here #88021 (comment), on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of #46906 Close #88021 Pull Request resolved: #88401 Approved by: https://github.com/ngimel
This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here pytorch#88021 (comment), on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of pytorch#46906 Close pytorch#88021 Pull Request resolved: pytorch#88401 Approved by: https://github.com/ngimel


per title
This PR did
apex.parallel.SyncBatchNormchannels_last to pytorchtorch.nn.SyncBatchNormsum,divkernels into backward elementwise kernelpytorch/torch/nn/modules/_functions.py
Lines 76 to 95 in b167402
Todo
pytorch/torch/nn/modules/_functions.py
Lines 32 to 34 in b167402
Comment: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far.
I think we can relax that for channels_last_3d as well?
Comment: we don't have benchmark for this now, will check this and add functionality later when needed.
Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last
Close #50781