[Pytorch] Improve conversion from bf16 on aarch64/NEON by Nicoshev · Pull Request #166880 · pytorch/pytorch

Nicoshev · 2025-11-03T19:02:46Z

Summary:
Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16

Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.

We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

Before:

bfloat16_t->uint8 ===> 423.583us
bfloat16_t->int8 ===> 424.090us
bfloat16_t->int16 ===> 430.817us
bfloat16_t->int64 ===> 571.547us
bfloat16_t->double ===> 459.089us

After:

bfloat16_t->uint8 ===> 123.783us ----> 342% higher throughput
bfloat16_t->int8 ===> 131.575us -----> 322% higher throughput
bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput
bfloat16_t->int64 ===> 177.699us ----> 322% higher throughput
bfloat16_t->double ===> 165.556us ---> 277% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86119613

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

pytorch-bot · 2025-11-03T19:02:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166880

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

✅ No Failures

As of commit 98734d2 with merge base 3a38ec7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-03T19:02:56Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86119613.

Summary: Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16 Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now. We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into 4 bytes. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: Before: bfloat16_t->uint8 ===> 423.583us bfloat16_t->int8 ===> 424.090us bfloat16_t->int16 ===> 430.817us bfloat16_t->int64 ===> 571.547us bfloat16_t->double ===> 459.089us After: bfloat16_t->uint8 ===> 142.698us ----> 297% higher throughput bfloat16_t->int8 ===> 134.837us -----> 315% higher throughput bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput bfloat16_t->int64 ===> 200.364us ----> 285% higher throughput bfloat16_t->double ===> 137.103us ---> 335% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86119613

aten/src/ATen/cpu/vec/vec128/vec128_convert.h

Summary: Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16 Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now. We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into 4 bytes. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: Before: bfloat16_t->uint8 ===> 423.583us bfloat16_t->int8 ===> 424.090us bfloat16_t->int16 ===> 430.817us bfloat16_t->int64 ===> 571.547us bfloat16_t->double ===> 459.089us After: bfloat16_t->uint8 ===> 142.698us ----> 297% higher throughput bfloat16_t->int8 ===> 134.837us -----> 315% higher throughput bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput bfloat16_t->int64 ===> 200.364us ----> 285% higher throughput bfloat16_t->double ===> 137.103us ---> 335% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D86119613

facebook-github-bot · 2025-11-04T00:11:57Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-11-04T00:13:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16 Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now. We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: Before: bfloat16_t->uint8 ===> 423.583us bfloat16_t->int8 ===> 424.090us bfloat16_t->int16 ===> 430.817us bfloat16_t->int64 ===> 571.547us bfloat16_t->double ===> 459.089us After: bfloat16_t->uint8 ===> 123.783us ----> 342% higher throughput bfloat16_t->int8 ===> 131.575us -----> 322% higher throughput bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput bfloat16_t->int64 ===> 177.699us ----> 322% higher throughput bfloat16_t->double ===> 165.556us ---> 277% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86119613 Pull Request resolved: #166880 Approved by: https://github.com/mcfi, https://github.com/aditew01

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 3, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 3, 2025

Nicoshev requested a review from mcfi November 3, 2025 19:27

Nicoshev force-pushed the export-D86119613 branch from 9a25755 to 485e2de Compare November 3, 2025 19:28

Nicoshev added ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request release notes: cpu (aarch64) release notes category for aarch64, arm, etc. module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels Nov 3, 2025

mcfi approved these changes Nov 3, 2025

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_convert.h Show resolved Hide resolved

Nicoshev force-pushed the export-D86119613 branch from 485e2de to 98734d2 Compare November 3, 2025 20:18

Nicoshev changed the title ~~[Pytorch] Improve conversion from bf16~~ [Pytorch] Improve conversion from bf16 on aarch64/NEON Nov 3, 2025

aditew01 approved these changes Nov 3, 2025

View reviewed changes

pytorchmergebot added the merging label Nov 4, 2025

pytorchmergebot closed this in 64819e3 Nov 4, 2025

pytorchmergebot added Merged and removed merging labels Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Improve conversion from bf16 on aarch64/NEON#166880

[Pytorch] Improve conversion from bf16 on aarch64/NEON#166880
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86119613

Nicoshev commented Nov 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 3, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Nov 4, 2025

Uh oh!

pytorchmergebot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Nicoshev commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166880

❗ 1 Active SEVs

✅ No Failures

Uh oh!

meta-codesync bot commented Nov 3, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Nov 4, 2025

Uh oh!

pytorchmergebot commented Nov 4, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nicoshev commented Nov 3, 2025 •

edited

Loading

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading