-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve numeric stability for LayerNorm on CUDA #40308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) ghstack-source-id: 106260530 Pull Request resolved: #40308
|
link #40302 |
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 106291482 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
|
Can you please profile performance with sizes from this script? I'm concerned about using int64 for computation, it might slow down the kernels considerably. |
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 109969199 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
💊 CI failures summary and remediationsAs of commit a118b98 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 16 times. |
I tried to change the index summation to int32 type, but I didn't see great difference. On one V100 machine with input size [4096, 65536], the average running time of int32 and int64 is 4.375ms vs 4.384ms. |
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 110005133 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 111954875 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 111964204 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited. Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/) [ghstack-poisoned]
Pull Request resolved: #40308 Improve numeric stability for LayerNorm on CUDA. This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small. Example. M = 1024, N = 32768 input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0 Previously in this case, the computed variance is [0. , 0.000977, 0. , ..., 0. , 0. , 0. ] After this diff, the variance will be [1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04] This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05. ghstack-source-id: 112836419 Differential Revision: [D21993664](https://our.internmc.facebook.com/intern/diff/D21993664/)
Codecov Report
@@ Coverage Diff @@
## gh/BIT-silence/4/base #40308 +/- ##
=========================================================
- Coverage 68.06% 68.05% -0.01%
=========================================================
Files 393 393
Lines 50918 50918
=========================================================
- Hits 34655 34654 -1
- Misses 16263 16264 +1
Continue to review full report at Codecov.
|
|
#59987 should fix the issue. |
Stack from ghstack:
Improve numeric stability for LayerNorm on CUDA.
This diff will slightly decrease the performance of LayerNorm but increase the numerical stability especially when the variance is small.
Example.
M = 1024, N = 32768
input = torch.randn(M, N, dtype=torch.float, device="cuda") * 0.01 + 100.0
Previously in this case, the computed variance is
[0. , 0.000977, 0. , ..., 0. , 0. , 0. ]
After this diff, the variance will be
[1.002436e-04, 9.923753e-05, 1.014664e-04, ..., 1.015593e-04, 1.000437e-04, 1.009089e-04]
This is is close to 1e-4 as we expected. The max relative difference compare to double result is 5.69883169e-05.
The average running time for F.layer_norm with input size [4096, 32768] on one devgpu before this PR is about 5.18ms, and after this PR is about 5.35ms. So the influence on performance should be limited.
Differential Revision: D21993664