RowwiseMoments: use float as acc type for bfloat16 inputs #81850

mingfeima · 2022-07-21T03:13:29Z

Stack from ghstack:

Originally utils::RowwiseMoments<BFloat16> will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

No effect on float performance, will improve bfloat16 performance:

avx512 single socket:

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms

avx512 single core:

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms

avx2 single socket:

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms

avx2 single core:

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms

Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16, which is not only slow but also introducing additional rounding errors. This patch will do accumulation on float for the bfloat16 inputs: each of bfloat16 vec (size 16) will be converted to two float vec (size 8), and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs. [ghstack-poisoned]

facebook-github-bot · 2022-07-21T03:13:34Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/81850
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 8d6edf6 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mingfeima · 2022-07-21T03:42:58Z

This PR is to fix #77507

Allowing bfloat16 to be accumulated in float32 also brings performance improvement since we don't have to redundant dtype conversion which is very time consuming.

opt1 refers to fix RowwiseMoments vectorization issue on CPU #81849
opt2 refers to RowwiseMoments: use float as acc type for bfloat16 inputs #81850
diff1 refers is improvement of opt1 against before
diff2 refers is improvement of opt2 against opt1

this PR has no effect on fp32 performance, will bring 4-5x performance improvement on bf16. So together bf16 will be improved 12-15x.

avx512 result

number of cores	data type	before	opt1	opt2	diff1	diff2
20	float32	0.439	0.210	0.215	209.05%	97.67%
20	bfloat16	2.479	0.770	0.178	321.95%	432.58%
1	float32	6.308	2.661	2.618	237.05%	101.64%
1	bfloat16	39.765	12.267	2.309	324.16%	531.27%

avx2 result

number of cores	data type	before	opt1	opt2	diff1	diff2
12	float32	1.248	0.540	0.527	231.11%	102.47%
12	bfloat16	8.487	2.030	0.458	418.08%	443.23%
1	float32	10.792	4.349	4.416	248.15%	98.48%
1	bfloat16	66.366	19.252	3.524	344.72%	546.31%

To fix #77507 Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16, which is not only slow but also introducing additional rounding errors. This patch will do accumulation on float for the bfloat16 inputs: each of bfloat16 vec (size 16) will be converted to two float vec (size 8), and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs. No effect on float performance, will improve bfloat16 performance: * avx512 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms ``` * avx512 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms ``` * avx2 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms ``` * avx2 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms ``` [ghstack-poisoned]

ezyang

thanks

ezyang · 2022-07-22T03:41:31Z

@pytorchbot merge -g

pytorchmergebot · 2022-07-22T03:48:01Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-07-22T06:23:37Z

Merge failed due to Matched rule superuser, but PR #81849 has not been reviewed yet
Raised by https://github.com/pytorch/pytorch/actions/runs/2716156252

swolchok · 2022-08-10T19:04:01Z

@pytorchbot merge

pytorchmergebot · 2022-08-10T19:05:31Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-08-10T19:05:33Z

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2835046914

swolchok · 2022-08-10T19:06:07Z

@mingfeima looks like you'll need to rebase and ping @ezyang for a land

malfet · 2022-08-10T21:58:23Z

One can rebase using the rebase command of the mergebot

malfet · 2022-08-10T21:58:38Z

@pytorchbot merge -f

pytorch-bot · 2022-08-10T21:58:39Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-g | -f MESSAGE | -l]

Try @pytorchbot --help for more info.

malfet · 2022-08-10T22:04:59Z

@pytorchbot merge -f "This codepath is unlikely to change recently"

pytorchmergebot · 2022-08-10T22:10:34Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-10T22:11:16Z

Hey @mingfeima.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

malfet · 2022-08-11T22:08:51Z

@pytorchbot revert -c weird "Revert as caused perf regression, see pytorch/benchmark#1099"

pytorch-bot · 2022-08-11T22:08:52Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

malfet · 2022-08-11T22:09:04Z

@pytorchbot revert -c weird -m "Revert as caused perf regression, see pytorch/benchmark#1099"

pytorchmergebot · 2022-08-11T22:10:25Z

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-08-11T22:10:32Z

@mingfeima your PR has been successfully reverted.

…1850)" This reverts commit 2fe3ea6. Reverted #81850 on behalf of https://github.com/malfet due to Revert as caused perf regression, see pytorch/benchmark#1099

…81850) Summary: To fix #77507 Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16, which is not only slow but also introducing additional rounding errors. This patch will do accumulation on float for the bfloat16 inputs: each of bfloat16 vec (size 16) will be converted to two float vec (size 8), and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs. No effect on float performance, will improve bfloat16 performance: * avx512 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms ``` * avx512 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms ``` * avx2 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms ``` * avx2 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms ``` Pull Request resolved: #81850 Approved by: https://github.com/ezyang, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2fe3ea65c2b9147077ea3a3dc4757f1768483ba4 Reviewed By: seemethere Differential Revision: D38600344 fbshipit-source-id: 63929b302c9c0adc1ec7fc2ecd3416e3cff72cb5

…1850)" Summary: This reverts commit 2fe3ea6. Reverted #81850 on behalf of https://github.com/malfet due to Revert as caused perf regression, see pytorch/benchmark#1099 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/7e6da2fb1048392cec2eb163c8ebf98625a0d468 Reviewed By: seemethere Differential Revision: D38643463 fbshipit-source-id: bf4069be8487591a83b0b4f619e03286142a6698

facebook-github-bot added the cla signed label Jul 21, 2022

This was referenced Jul 21, 2022

fix RowwiseMoments vectorization issue on CPU #81849

Closed

add mixed data type support for LayerNorm #81851

Closed

add mixed data type support for GroupNorm #81852

Closed

pytorchbot added the open source label Jul 21, 2022

mingfeima mentioned this pull request Jul 21, 2022

bfloat16 group_norm on CPU does moments calculation in bfloat16 #77507

Closed

mingfeima requested review from ezyang, ngimel, swolchok and xiaomengy July 22, 2022 01:42

mingfeima added the intel This tag is for PR from Intel label Jul 22, 2022

ezyang approved these changes Jul 22, 2022

View reviewed changes

malfet approved these changes Aug 10, 2022

View reviewed changes

pytorchmergebot added the Merged label Aug 10, 2022

pytorchmergebot closed this in 2fe3ea6 Aug 10, 2022

pytorchmergebot added the Reverted label Aug 11, 2022

facebook-github-bot deleted the gh/mingfeima/81/head branch August 14, 2022 14:18

mingfeima restored the gh/mingfeima/81/head branch September 1, 2022 07:24

mingfeima mentioned this pull request Sep 1, 2022

RowwiseMoments: use float as acc type for bfloat16 inputs #84405

Closed

mingfeima deleted the gh/mingfeima/81/head branch September 1, 2022 07:40

RowwiseMoments: use float as acc type for bfloat16 inputs #81850

RowwiseMoments: use float as acc type for bfloat16 inputs #81850

Uh oh!

Conversation

mingfeima commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

mingfeima commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

avx512 result

avx2 result

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 22, 2022

Uh oh!

pytorchmergebot commented Jul 22, 2022

Uh oh!

pytorchmergebot commented Jul 22, 2022

Uh oh!

swolchok commented Aug 10, 2022

Uh oh!

pytorchmergebot commented Aug 10, 2022

Uh oh!

pytorchmergebot commented Aug 10, 2022

Uh oh!

swolchok commented Aug 10, 2022

Uh oh!

malfet commented Aug 10, 2022

Uh oh!

malfet commented Aug 10, 2022

Uh oh!

pytorch-bot bot commented Aug 10, 2022

Uh oh!

malfet commented Aug 10, 2022

Uh oh!

pytorchmergebot commented Aug 10, 2022

Uh oh!

github-actions bot commented Aug 10, 2022

Uh oh!

malfet commented Aug 11, 2022

Uh oh!

pytorch-bot bot commented Aug 11, 2022

Uh oh!

malfet commented Aug 11, 2022

Uh oh!

pytorchmergebot commented Aug 11, 2022

Uh oh!

pytorchmergebot commented Aug 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mingfeima commented Jul 21, 2022 •

edited

Loading

facebook-github-bot commented Jul 21, 2022 •

edited

Loading

mingfeima commented Jul 21, 2022 •

edited

Loading