Improve numeric stability for LayerNorm on CUDA #40308

xiaomengy · 2020-06-19T20:59:52Z

Stack from ghstack:

Improve numeric stability for LayerNorm on CUDA #40308 Improve numeric stability for LayerNorm on CUDA
Improve numeric stability for LayerNorm on CPU #40307 Improve numeric stability for LayerNorm on CPU

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0. , 0.000977, 0. , ..., 0. , 0. , 0. ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: D21993664

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) ghstack-source-id: 106260530 Pull Request resolved: #40308

xiaomengy · 2020-06-19T21:01:08Z

link #40302

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 106291482 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

ngimel · 2020-07-15T00:53:47Z

Can you please profile performance with sizes from this script? I'm concerned about using int64 for computation, it might slow down the kernels considerably.

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 109969199 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

dr-ci · 2020-08-14T20:37:53Z

💊 CI failures summary and remediations

As of commit a118b98 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 16 times.

xiaomengy · 2020-08-14T21:46:32Z

Can you please profile performance with sizes from this script? I'm concerned about using int64 for computation, it might slow down the kernels considerably.

I tried to change the index summation to int32 type, but I didn't see great difference. On one V100 machine with input size [4096, 65536], the average running time of int32 and int64 is 4.375ms vs 4.384ms.

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 110005133 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 111954875 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 111964204 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]

Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 112836419 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

codecov · 2020-09-25T01:07:47Z

Codecov Report

Merging #40308 into gh/BIT-silence/4/base will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                    Coverage Diff                    @@
##           gh/BIT-silence/4/base   #40308      +/-   ##
=========================================================
- Coverage                  68.06%   68.05%   -0.01%     
=========================================================
  Files                        393      393              
  Lines                      50918    50918              
=========================================================
- Hits                       34655    34654       -1     
- Misses                     16263    16264       +1

Impacted Files	Coverage Δ
torch/testing/_internal/expecttest.py	`77.55% <0.00%> (-1.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09ee22f...a118b98. Read the comment docs.

xiaomengy · 2021-06-26T00:28:44Z

#59987 should fix the issue.

xiaomengy mentioned this pull request Jun 19, 2020

Improve numeric stability for LayerNorm on CPU #40307

Closed

xiaomengy requested a review from ngimel June 19, 2020 21:08

xiaomengy mentioned this pull request Sep 17, 2020

Fix a GroupNorm cuda bug when input does not require_grad #44863

Closed

facebook-github-bot added the cla signed label Oct 30, 2020

xiaomengy closed this Jun 26, 2021

facebook-github-bot deleted the gh/BIT-silence/4/head branch July 26, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve numeric stability for LayerNorm on CUDA #40308

Improve numeric stability for LayerNorm on CUDA #40308

Uh oh!

xiaomengy commented Jun 19, 2020 •

edited

Loading

Uh oh!

xiaomengy commented Jun 19, 2020

Uh oh!

ngimel commented Jul 15, 2020

Uh oh!

dr-ci bot commented Aug 14, 2020 •

edited

Loading

Uh oh!

xiaomengy commented Aug 14, 2020

Uh oh!

codecov bot commented Sep 25, 2020 •

edited

Loading

Uh oh!

xiaomengy commented Jun 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve numeric stability for LayerNorm on CUDA #40308

Improve numeric stability for LayerNorm on CUDA #40308

Uh oh!

Conversation

xiaomengy commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaomengy commented Jun 19, 2020

Uh oh!

ngimel commented Jul 15, 2020

Uh oh!

dr-ci bot commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

xiaomengy commented Aug 14, 2020

Uh oh!

codecov bot commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiaomengy commented Jun 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiaomengy commented Jun 19, 2020 •

edited

Loading

dr-ci bot commented Aug 14, 2020 •

edited

Loading

codecov bot commented Sep 25, 2020 •

edited

Loading