Skip to content

[Pytorch] Improve conversion from bf16 on aarch64/NEON#166880

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86119613
Closed

[Pytorch] Improve conversion from bf16 on aarch64/NEON#166880
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86119613

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Nov 3, 2025

Summary:
Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16

Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.

We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

Before:

bfloat16_t->uint8 ===> 423.583us
bfloat16_t->int8 ===> 424.090us
bfloat16_t->int16 ===> 430.817us
bfloat16_t->int64 ===> 571.547us
bfloat16_t->double ===> 459.089us

After:

bfloat16_t->uint8 ===> 123.783us ----> 342% higher throughput
bfloat16_t->int8 ===> 131.575us -----> 322% higher throughput
bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput
bfloat16_t->int64 ===> 177.699us ----> 322% higher throughput
bfloat16_t->double ===> 165.556us ---> 277% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86119613

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166880

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 98734d2 with merge base 3a38ec7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 3, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 3, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86119613.

@Nicoshev Nicoshev requested a review from mcfi November 3, 2025 19:27
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 3, 2025
Summary:

Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16

Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.

We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into 4 bytes.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

Before:

bfloat16_t->uint8  ===> 423.583us
bfloat16_t->int8  ===> 424.090us
bfloat16_t->int16  ===> 430.817us
bfloat16_t->int64  ===> 571.547us
bfloat16_t->double ===> 459.089us

After:

bfloat16_t->uint8  ===> 142.698us  ----> 297% higher throughput
bfloat16_t->int8  ===> 134.837us  -----> 315% higher throughput
bfloat16_t->int16  ===> 136.794us  ----> 315% higher throughput
bfloat16_t->int64  ===> 200.364us  ----> 285% higher throughput
bfloat16_t->double  ===> 137.103us  ---> 335% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86119613
@Nicoshev Nicoshev added ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request release notes: cpu (aarch64) release notes category for aarch64, arm, etc. module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels Nov 3, 2025
Summary:

Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16

Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.

We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into 4 bytes.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

Before:

bfloat16_t->uint8  ===> 423.583us
bfloat16_t->int8  ===> 424.090us
bfloat16_t->int16  ===> 430.817us
bfloat16_t->int64  ===> 571.547us
bfloat16_t->double ===> 459.089us

After:

bfloat16_t->uint8  ===> 142.698us  ----> 297% higher throughput
bfloat16_t->int8  ===> 134.837us  -----> 315% higher throughput
bfloat16_t->int16  ===> 136.794us  ----> 315% higher throughput
bfloat16_t->int64  ===> 200.364us  ----> 285% higher throughput
bfloat16_t->double  ===> 137.103us  ---> 335% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86119613
@Nicoshev Nicoshev changed the title [Pytorch] Improve conversion from bf16 [Pytorch] Improve conversion from bf16 on aarch64/NEON Nov 3, 2025
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Nov 4, 2025
Summary:
Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16

Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.

We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

Before:

bfloat16_t->uint8  ===> 423.583us
bfloat16_t->int8  ===> 424.090us
bfloat16_t->int16  ===> 430.817us
bfloat16_t->int64  ===> 571.547us
bfloat16_t->double ===> 459.089us

After:

bfloat16_t->uint8  ===> 123.783us  ----> 342% higher throughput
bfloat16_t->int8  ===> 131.575us  -----> 322% higher throughput
bfloat16_t->int16  ===> 136.794us  ----> 315% higher throughput
bfloat16_t->int64  ===> 177.699us  ----> 322% higher throughput
bfloat16_t->double  ===> 165.556us  ---> 277% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86119613

Pull Request resolved: #166880
Approved by: https://github.com/mcfi, https://github.com/aditew01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants