fix RowwiseMoments vectorization issue on CPU #84404

mingfeima · 2022-09-01T07:35:20Z

Stack from ghstack:

Originally cpu/moments_utils.h uses namespace of at::native::utils,
this file contains Vectorized<>, in order to make it properly vectorized
on different archs, need to use anonymous namespace or inline namespace.
Otherwise it would be linked to scalar version of the code.

This PR is to fix vectorization issue from RowwiseMoments which is used to calculate mean and rstd in norm layers.
Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup.

This patch will improves layer_norm (input size 32x128x1024) float32 inference:

avx512 single socket: 2.1x

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms

avx512 single core: 3.2x

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms

avx2 single socket: 2.3x

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms

avx2 single core: 2.5x

before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms

Attached some original VTune profiling results here to further indicate the issue:

original bottlenecks

we can see RowwiseMomentsImpl<> takes majority of the runtime here.

Instruction level breakdown of RowwiseMomentsImpl<>

we can see it's all scalar instructions here.

after the fix, the bottlenecks

getting better.

after the fix, Instruction level breakdown of RowwiseMomentsImpl<>

now it is all vectorized instructions.

cc @VitalyFedyunin @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. [ghstack-poisoned]

facebook-github-bot · 2022-09-01T07:35:28Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84404
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 4ba0629 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mingfeima · 2022-09-01T07:39:10Z

replacement of #81849.
need to fix this performance regression pytorch/benchmark#1099

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. [ghstack-poisoned]

pytorch-bot · 2022-09-08T05:38:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84404

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 229cf48:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2022-09-08T05:38:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84404

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mingfeima · 2022-09-21T05:58:14Z

since pytorch/benchmark#1099 has been identifies as false alarm, shall we proceed to review this PR again? @malfet, @frank-wei

malfet

Hmm, inline namespace concept seems dangerous to me, as I'm not sure I understand how it will guarantee that symbols from say avx512 namespace will not get included from avx2 -only code? Perhaps you just need to add a regular namespace and call utils::CPU_CAPABILITY::RowwiseMoments?

mingfeima · 2022-09-23T02:40:01Z

Hmm, inline namespace concept seems dangerous to me, as I'm not sure I understand how it will guarantee that symbols from say avx512 namespace will not get included from avx2 -only code? Perhaps you just need to add a regular namespace and call utils::CPU_CAPABILITY::RowwiseMoments?

Initially, all the CPU kernels under aten/src/ATen/native/cpu which requires vectorization uses anonymous namespaces, which will make the func static and linked to different assembly for scalar/avx2/avx512. For example, like the CatKernel here.

Later on some kernels are changed to use inline namespace, for example, like this one: CopyKernel. Sure this will also do the job, but honestly I'm not sure why this is introduced at the first place ...

@malfet Is it OK I change this file back to anonymous namespaces ? Right now, most of the CPU kernels are still written in this way.

[Edit]: I have verified that both inline namespace and anonymous namespace can properly vectorized the code.

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. [ghstack-poisoned]

facebook-github-bot · 2022-10-04T00:28:47Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

linux-foundation-easycla · 2022-10-04T00:28:56Z

The committers listed above are authorized under a signed CLA.

✅ login: mingfeima / name: Ma Mingfei (4ba0629, c5189fd, 0c88bfc, 6251e42)

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. [ghstack-poisoned]

jgong5

Hmm, inline namespace concept seems dangerous to me, as I'm not sure I understand how it will guarantee that symbols from say avx512 namespace will not get included from avx2 -only code? Perhaps you just need to add a regular namespace and call utils::CPU_CAPABILITY::RowwiseMoments?

Later on some kernels are changed to use inline namespace, for example, like this one: CopyKernel. Sure this will also do the job, but honestly I'm not sure why this is introduced at the first place ...

My understanding is that inline namespace is preferred for functions defined in the header files (e.g., the moment_utils.h in this PR). With this, there won't be duplicated definitions for source files including it. For functions defined in source files, using anonymous namespace should be fine and seems most of the PyTorch source files follow this. The CopyKernel seems like an exception since the functions (e.g., direct_copy_kernel) of CopyKernel are also exposed directly in the header file and used by other kernels. I would suggest we use inline namespace for moment_utils.h.

@VitalyFedyunin

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

mingfeima · 2022-11-29T05:20:24Z

@jgong5 updated, change back to inline namespace

@VitalyFedyunin

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

mingfeima · 2022-11-30T04:52:06Z

@pytorchbot merge

pytorchmergebot · 2022-11-30T04:54:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all **scalar** instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all **vectorized** instructions. Pull Request resolved: pytorch#84404 Approved by: https://github.com/jgong5

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

This PR is cherry-picked from #84404 ~ #81852. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot added the cla signed label Sep 1, 2022

mingfeima mentioned this pull request Sep 1, 2022

RowwiseMoments: use float as acc type for bfloat16 inputs #84405

Closed

mingfeima marked this pull request as draft September 1, 2022 07:36

This was referenced Sep 1, 2022

add mixed data type support for LayerNorm #81851

Closed

add mixed data type support for GroupNorm #81852

Closed

pytorchbot added the open source label Sep 1, 2022

mingfeima mentioned this pull request Sep 15, 2022

V2 Performance Signal Detected by TorchBench CI on '1.13.0.dev20220811+cu113' pytorch/benchmark#1099

Closed

mingfeima marked this pull request as ready for review September 21, 2022 05:58

mingfeima requested review from frank-wei and malfet September 21, 2022 05:59

malfet requested changes Sep 21, 2022

View reviewed changes

mingfeima requested a review from malfet September 23, 2022 02:40

mingfeima added 2 commits September 27, 2022 10:54

CaoE mentioned this pull request Oct 18, 2022

optimize the performance of binary_kernel_reduce for welford using Ro… #84467

Closed

zhuhaozhe closed this Oct 20, 2022

zhuhaozhe reopened this Oct 20, 2022

CaoE added a commit that referenced this pull request Nov 17, 2022

Update base for Update on "add mixed data type support for LayerNorm …

5ab20e2

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Nov 17, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

04a4cb7

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Nov 17, 2022

Update base for Update on "add mixed data type support for LayerNorm …

3b7f0fb

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Nov 17, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

a1c4ac2

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Nov 22, 2022

Update base for Update on "add mixed data type support for LayerNorm …

69a9789

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Nov 22, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

c7b516e

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 28, 2022

mingfeima added the topic: not user facing topic category label Nov 28, 2022

mingfeima requested a review from jgong5 November 28, 2022 04:51

jgong5 requested changes Nov 28, 2022

View reviewed changes

mingfeima requested a review from jgong5 November 29, 2022 05:19

jgong5 approved these changes Nov 29, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 30, 2022

pytorchmergebot added the Merged label Nov 30, 2022

pytorchmergebot closed this in 87d18cf Nov 30, 2022

CaoE added a commit that referenced this pull request Dec 5, 2022

Update base for Update on "add mixed data type support for LayerNorm …

f9fbbc4

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Dec 5, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

7bc8f43

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Dec 9, 2022

Update base for Update on "add mixed data type support for LayerNorm …

24a6ab5

…and GroupNorm" This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Dec 9, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

5c691e4

This PR is cherry-picked from #84404 ~ #81852. [ghstack-poisoned]

CaoE added a commit that referenced this pull request Dec 13, 2022

Update on "add mixed data type support for LayerNorm and GroupNorm"

84e82f8

This PR is cherry-picked from #84404 ~ #81852. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot deleted the gh/mingfeima/86/head branch June 8, 2023 18:01

fix RowwiseMoments vectorization issue on CPU #84404

fix RowwiseMoments vectorization issue on CPU #84404

Uh oh!

Conversation

mingfeima commented Sep 1, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

mingfeima commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84404

✅ No Failures

Uh oh!

pytorch-bot bot commented Sep 8, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84404

Uh oh!

mingfeima commented Sep 21, 2022

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

linux-foundation-easycla bot commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Nov 29, 2022

Uh oh!

mingfeima commented Nov 30, 2022

Uh oh!

pytorchmergebot commented Nov 30, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mingfeima commented Sep 1, 2022 •

edited by pytorch-bot bot

Loading

facebook-github-bot commented Sep 1, 2022 •

edited

Loading

mingfeima commented Sep 1, 2022 •

edited

Loading

pytorch-bot bot commented Sep 8, 2022 •

edited

Loading

mingfeima commented Sep 23, 2022 •

edited

Loading

linux-foundation-easycla bot commented Oct 4, 2022 •

edited

Loading