Skip to content

Conversation

@jiayisunx
Copy link
Collaborator

@jiayisunx jiayisunx commented Oct 31, 2022

Motivation

Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm.

Testing

Single socket (icx, 32cores):

shape fp32 forward (ms) bf16 forward (ms) mix forward (ms) fp32 backward (ms) bf16 backward (ms) mix backward (ms)
(1, 8, 16) 0.012 0.012 0.012 0.071 0.065 0.062
(8, 8, 16) 0.015 0.014 0.015 0.074 0.070 0.063
(32, 8, 16) 0.062 0.016 0.016 0.073 0.073 0.072
(64, 128, 56, 56) 2.467 0.907 0.0897 12.993 7.603 7.777
(64, 128, 256, 256) 48.904 25.589 25.472 343.992 183.133 188.222

Single core(icx):

shape fp32 forward (ms) bf16 forward (ms) mix forward (ms) fp32 backward (ms) bf16 backward (ms) mix backward (ms)
(1, 8, 16) 0.012 0.012 0.012 0.050 0.050 0.050
(8, 8, 16) 0.014 0.014 0.014 0.052 0.054 0.053
(32, 8, 16) 0.034 0.019 0.018 0.059 0.067 0.066
(64, 128, 56, 56) 66.791 17.725 19.799 119.431 106.123 107.446
(64, 128, 256, 256) 1542.477 402.132 527.044 3019.437 2336.318 2448.320

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Oct 31, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 31, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88064

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 27f3948:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jiayisunx jiayisunx marked this pull request as draft October 31, 2022 03:03
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 31, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: jiayisunx (118b40d11705558627c2f7555fda2d6650a2f0e1)

@jgong5
Copy link
Collaborator

jgong5 commented Oct 31, 2022

@jiayisunx Do you mind elaborate the motivation of this PR in the description?

@jiayisunx
Copy link
Collaborator Author

@pytorchbot label intel

@pytorch-bot pytorch-bot bot added the intel This tag is for PR from Intel label Nov 17, 2022
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 5, 2022
@jiayisunx jiayisunx marked this pull request as ready for review December 6, 2022 08:42
@jiayisunx
Copy link
Collaborator Author

@jgong5, @mingfeima, could you please help review this PR? thanks.

@bdhirsh bdhirsh requested a review from mingfeima December 7, 2022 02:40
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 7, 2022
@jiayisunx
Copy link
Collaborator Author

@pytorchbot label intel priority

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 14, 2022

Didn't find following labels among repository labels: priority

@zhuhaozhe zhuhaozhe added the intel priority matters to intel architecture from performance wise label Dec 14, 2022
@zhuhaozhe zhuhaozhe requested a review from malfet December 14, 2022 00:51
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like layer_norm_backward_kernel_mixed_type is just a copy of LayerNormBackwardKernelImplInternal with slightly different argument types.

Perhaps a better approach would be to just move ACC_T as template argument of LayerNormBackwardKernelImplInternal and call it slightly differently in mixed precision type scenario

Copy link
Collaborator Author

@jiayisunx jiayisunx Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like layer_norm_backward_kernel_mixed_type is just a copy of LayerNormBackwardKernelImplInternal with slightly different argument types.

Perhaps a better approach would be to just move ACC_T as template argument of LayerNormBackwardKernelImplInternal and call it slightly differently in mixed precision type scenario

@malfet thanks for your comments, actually this backward implementation is consistent with forward, and there are many places that need to be modified to support mixed data type, not just slightly different argument type. For example, here, we cannot simply call vec::map3_reduce_all to do this part.

@jiayisunx jiayisunx closed this Dec 15, 2022
@jiayisunx jiayisunx reopened this Dec 15, 2022
@jiayisunx jiayisunx requested a review from malfet December 15, 2022 03:11
@jiayisunx
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased layer_norm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout layer_norm && git pull --rebase)

@jiayisunx
Copy link
Collaborator Author

@malfet , I have refactored the layernorm backward kernel, could you please help to review this PR again?

@atalman atalman added this to the 2.0.0 milestone Jan 11, 2023
@jiayisunx jiayisunx force-pushed the layer_norm branch 2 times, most recently from 3770921 to 1a4595e Compare January 18, 2023 07:46
@jiayisunx
Copy link
Collaborator Author

@malfet

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though template specialization is essentially copy-n-paste from the generic template, would it be possible to somehow reuse code more by introducing more templates. I.e. why bf16 can not utilize the same vec::map primitives

Comment on lines +75 to +86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what's the difference between this implementation and the previous one? Should it be just the same template with default argument, something like `load2f(const BFloat16*ptr, int64_t count = 1);

Also, why int64 rather than uint64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous one is the helper for mixed data type parameter Vec::load(ptr), this implementation is the helper for Vec::load(ptr, count), and use int64 because loadu(const void* ptr, int64_t count) uses int64.

@jiayisunx
Copy link
Collaborator Author

Looks good, though template specialization is essentially copy-n-paste from the generic template, would it be possible to somehow reuse code more by introducing more templates. I.e. why bf16 can not utilize the same vec::map primitives

Actually bf16 can utilize the same vec::map primitives, but mixed datatype cannot. And the template specialization is for mixed datatype cases.

@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

@jiayisunx
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased layer_norm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout layer_norm && git pull --rebase)

@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants