Skip to content

Conversation

@xiaomengy
Copy link
Contributor

@xiaomengy xiaomengy commented Jun 19, 2020

Stack from ghstack:

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0. , 0.000977, 0. , ..., 0. , 0. , 0. ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: D21993664

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 19, 2020
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

ghstack-source-id: 106260530
Pull Request resolved: #40308
@xiaomengy
Copy link
Contributor Author

link #40302

@xiaomengy xiaomengy requested a review from ngimel June 19, 2020 21:08
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 20, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 106291482

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
@ngimel
Copy link
Collaborator

ngimel commented Jul 15, 2020

Can you please profile performance with sizes from this script? I'm concerned about using int64 for computation, it might slow down the kernels considerably.

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Aug 14, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 109969199

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
@dr-ci
Copy link

dr-ci bot commented Aug 14, 2020

💊 CI failures summary and remediations

As of commit a118b98 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 16 times.

@xiaomengy
Copy link
Contributor Author

Can you please profile performance with sizes from this script? I'm concerned about using int64 for computation, it might slow down the kernels considerably.

I tried to change the index summation to int32 type, but I didn't see great difference. On one V100 machine with input size [4096, 65536], the average running time of int32 and int64 is 4.375ms vs 4.384ms.

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Aug 15, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 110005133

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Sep 13, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 111954875

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Sep 14, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 111964204

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.

The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Sep 24, 2020
Pull Request resolved: #40308

Improve numeric stability for LayerNorm on CUDA.

This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.

Example.
M = 1024, N = 32768
input =  torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0

Previously in this case, the computed variance is
[0.      , 0.000977, 0.      , ..., 0.      , 0.      , 0.      ]

After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]

This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
ghstack-source-id: 112836419

Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #40308 into gh/BIT-silence/4/base will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@                    Coverage Diff                    @@
##           gh/BIT-silence/4/base   #40308      +/-   ##
=========================================================
- Coverage                  68.06%   68.05%   -0.01%     
=========================================================
  Files                        393      393              
  Lines                      50918    50918              
=========================================================
- Hits                       34655    34654       -1     
- Misses                     16263    16264       +1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09ee22f...a118b98. Read the comment docs.

@xiaomengy
Copy link
Contributor Author

#59987 should fix the issue.

@xiaomengy xiaomengy closed this Jun 26, 2021
@facebook-github-bot facebook-github-bot deleted the gh/BIT-silence/4/head branch July 26, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants