Skip to content

Conversation

@mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Jul 21, 2022

Stack from ghstack:

To fix #77507

Originally utils::RowwiseMoments<BFloat16> will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

No effect on float performance, will improve bfloat16 performance:

  • avx512 single socket:
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms
  • avx512 single core:
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms
  • avx2 single socket:
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms
  • avx2 single core:
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms

Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 21, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 8d6edf6 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@mingfeima
Copy link
Collaborator Author

mingfeima commented Jul 21, 2022

This PR is to fix #77507

Allowing bfloat16 to be accumulated in float32 also brings performance improvement since we don't have to redundant dtype conversion which is very time consuming.

this PR has no effect on fp32 performance, will bring 4-5x performance improvement on bf16. So together bf16 will be improved 12-15x.

avx512 result

number of cores data type before opt1 opt2 diff1 diff2
20 float32 0.439 0.210 0.215 209.05% 97.67%
20 bfloat16 2.479 0.770 0.178 321.95% 432.58%
1 float32 6.308 2.661 2.618 237.05% 101.64%
1 bfloat16 39.765 12.267 2.309 324.16% 531.27%

avx2 result

number of cores data type before opt1 opt2 diff1 diff2
12 float32 1.248 0.540 0.527 231.11% 102.47%
12 bfloat16 8.487 2.030 0.458 418.08% 443.23%
1 float32 10.792 4.349 4.416 248.15% 98.48%
1 bfloat16 66.366 19.252 3.524 344.72% 546.31%

To fix #77507

Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

No effect on float performance, will improve bfloat16 performance:
* avx512 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms
```
* avx512 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms
```
* avx2 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms
```
* avx2 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms
```

[ghstack-poisoned]
@mingfeima mingfeima added the intel This tag is for PR from Intel label Jul 22, 2022
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@ezyang
Copy link
Contributor

ezyang commented Jul 22, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to Matched rule superuser, but PR #81849 has not been reviewed yet
Raised by https://github.com/pytorch/pytorch/actions/runs/2716156252

@swolchok
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2835046914

@swolchok
Copy link
Contributor

@mingfeima looks like you'll need to rebase and ping @ezyang for a land

@malfet
Copy link
Contributor

malfet commented Aug 10, 2022

One can rebase using the rebase command of the mergebot

@malfet
Copy link
Contributor

malfet commented Aug 10, 2022

@pytorchbot merge -f

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 10, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-g | -f MESSAGE | -l]

Try @pytorchbot --help for more info.

@malfet
Copy link
Contributor

malfet commented Aug 10, 2022

@pytorchbot merge -f "This codepath is unlikely to change recently"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @mingfeima.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@malfet
Copy link
Contributor

malfet commented Aug 11, 2022

@pytorchbot revert -c weird "Revert as caused perf regression, see pytorch/benchmark#1099"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 11, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@malfet
Copy link
Contributor

malfet commented Aug 11, 2022

@pytorchbot revert -c weird -m "Revert as caused perf regression, see pytorch/benchmark#1099"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

@mingfeima your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 11, 2022
…1850)"

This reverts commit 2fe3ea6.

Reverted #81850 on behalf of https://github.com/malfet due to Revert as caused perf regression, see pytorch/benchmark#1099
facebook-github-bot pushed a commit that referenced this pull request Aug 11, 2022
…81850)

Summary:
To fix #77507

Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

No effect on float performance, will improve bfloat16 performance:
* avx512 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms
```
* avx512 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms
```
* avx2 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms
```
* avx2 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms
```

Pull Request resolved: #81850
Approved by: https://github.com/ezyang, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2fe3ea65c2b9147077ea3a3dc4757f1768483ba4

Reviewed By: seemethere

Differential Revision: D38600344

fbshipit-source-id: 63929b302c9c0adc1ec7fc2ecd3416e3cff72cb5
facebook-github-bot pushed a commit that referenced this pull request Aug 12, 2022
…1850)"

Summary:
This reverts commit 2fe3ea6.

Reverted #81850 on behalf of https://github.com/malfet due to Revert as caused perf regression, see pytorch/benchmark#1099

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/7e6da2fb1048392cec2eb163c8ebf98625a0d468

Reviewed By: seemethere

Differential Revision: D38643463

fbshipit-source-id: bf4069be8487591a83b0b4f619e03286142a6698
@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/81/head branch August 14, 2022 14:18
@mingfeima mingfeima restored the gh/mingfeima/81/head branch September 1, 2022 07:24
@mingfeima mingfeima deleted the gh/mingfeima/81/head branch September 1, 2022 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants