[Pytorch] Improve conversion to bfloat16 on aarch64/NEON by Nicoshev · Pull Request #166958 · pytorch/pytorch

Nicoshev · 2025-11-04T16:16:54Z

Summary:
Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t ===> 319.433us
int8->bfloat16_t ===> 320.216us
int16->bfloat16_t ===> 326.899us
int32->bfloat16_t ===> 327.925us

after:

uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput
int8->bfloat16_t ===> 169.790us -----> 89% higher throughput
int16->bfloat16_t ===> 180.744us -----> 81% higher throughput
int32->bfloat16_t ===> 185.129us -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

pytorch-bot · 2025-11-04T16:16:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166958

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 444d752 with merge base d273422 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-04T16:17:01Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86207189.

) Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86207189

) Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86207189

malfet

Please add performance benchmark

Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86207189

) Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86207189

Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86207189

) Summary: Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86207189

facebook-github-bot · 2025-11-14T01:53:22Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorch-bot · 2025-11-14T01:53:26Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Nicoshev · 2025-11-14T02:36:16Z

@pytorchbot merge -f "Benchmarks added in a separate PR. CI failure was on an x86 build compiled by gcc, it does not execute added path.

pytorch-bot · 2025-11-14T02:36:18Z

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

Nicoshev · 2025-11-14T02:38:04Z

@pytorchbot merge -f "Benchs added in PR167099. CI fails on an x86 build compiled by gcc, it doesn't execute new path"

pytorchmergebot · 2025-11-14T02:39:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) Summary: Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86207189 Pull Request resolved: pytorch#166958 Approved by: https://github.com/mcfi

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 4, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 4, 2025

Nicoshev requested a review from mcfi November 4, 2025 16:17

Nicoshev added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow release notes: cpu (aarch64) release notes category for aarch64, arm, etc. labels Nov 4, 2025

mcfi approved these changes Nov 4, 2025

View reviewed changes

Nicoshev force-pushed the export-D86207189 branch from d51ef2d to 27e728b Compare November 4, 2025 16:31

Nicoshev force-pushed the export-D86207189 branch from 27e728b to 587be9a Compare November 4, 2025 16:40

Nicoshev force-pushed the export-D86207189 branch from 587be9a to d9b2824 Compare November 4, 2025 17:10

Nicoshev force-pushed the export-D86207189 branch from d9b2824 to 01d38f2 Compare November 4, 2025 17:14

Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from 893bb57 to f563627 Compare November 4, 2025 17:21

Nicoshev force-pushed the export-D86207189 branch from f563627 to 59c4cf4 Compare November 4, 2025 17:22

Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from a2ceafe to c98755c Compare November 4, 2025 20:46

Nicoshev force-pushed the export-D86207189 branch from c98755c to a97e4bb Compare November 4, 2025 23:49

Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from 8639998 to c734960 Compare November 5, 2025 14:00

malfet requested changes Nov 5, 2025

View reviewed changes

Nicoshev force-pushed the export-D86207189 branch from c734960 to 15c74aa Compare November 11, 2025 16:53

Nicoshev requested a review from malfet November 11, 2025 18:39

Nicoshev force-pushed the export-D86207189 branch from 15c74aa to af6bf74 Compare November 11, 2025 18:40

Nicoshev force-pushed the export-D86207189 branch from af6bf74 to f61269a Compare November 11, 2025 21:48

Nicoshev force-pushed the export-D86207189 branch from f61269a to 1af3ed3 Compare November 12, 2025 17:59

Nicoshev force-pushed the export-D86207189 branch from 1af3ed3 to 29fd6ef Compare November 13, 2025 19:16

Nicoshev force-pushed the export-D86207189 branch from 29fd6ef to 444d752 Compare November 13, 2025 20:35

pytorchmergebot added the merging label Nov 14, 2025

pytorchmergebot closed this in 5e6ac5c Nov 14, 2025

pytorchmergebot added Merged and removed merging labels Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Improve conversion to bfloat16 on aarch64/NEON#166958

[Pytorch] Improve conversion to bfloat16 on aarch64/NEON#166958
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86207189

Nicoshev commented Nov 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 4, 2025

Uh oh!

malfet left a comment

Uh oh!

facebook-github-bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Nicoshev commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Nicoshev commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Nicoshev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166958

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

meta-codesync bot commented Nov 4, 2025

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Nicoshev commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Nicoshev commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nicoshev commented Nov 4, 2025 •

edited

Loading

pytorch-bot bot commented Nov 4, 2025 •

edited

Loading