[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028) by Nicoshev · Pull Request #166641 · pytorch/pytorch

Nicoshev · 2025-10-30T14:14:33Z

Summary:

PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

pytorch-bot · 2025-10-30T14:14:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166641

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8ff4b6e with merge base 85b035c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-30T14:14:43Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85809843.

Nicoshev · 2025-10-30T14:15:28Z

@pytorchbot label "release notes: cpu (aarch64)"

…h#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 74.986us ---> 52% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843

…h#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 reciprocal: 242.927us ---> 8% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843

aten/src/ATen/cpu/vec/vec128/vec128_bfloat16_neon.h

…h#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 reciprocal: 242.927us ---> 8% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843

…h#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843

malfet · 2025-10-30T21:20:01Z

@pytorchbot merge -f "All relevant tests are green"

pytorchmergebot · 2025-10-30T21:21:43Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-30T21:21:55Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

…h#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 abs: 115.953us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 abs: 71.003us ---> 63% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843

facebook-github-bot · 2025-10-31T18:13:12Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-10-31T18:15:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… (#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 abs: 115.953us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 abs: 71.003us ---> 63% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843 Pull Request resolved: #166641 Approved by: https://github.com/Skylion007, https://github.com/malfet

…h#166028) (pytorch#166641) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 abs: 115.953us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 abs: 71.003us ---> 63% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843 Pull Request resolved: pytorch#166641 Approved by: https://github.com/Skylion007, https://github.com/malfet

…pytorch#166028) (pytorch#166641)" This reverts commit b71966f.

pytorch#166028) (pytorch#166641)" This reverts commit 9367db2.

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 30, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 30, 2025

Nicoshev requested a review from malfet October 30, 2025 14:15

pytorch-bot bot added the release notes: cpu (aarch64) release notes category for aarch64, arm, etc. label Oct 30, 2025

Nicoshev requested a review from seemethere October 30, 2025 14:15

Nicoshev added ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request labels Oct 30, 2025

Nicoshev force-pushed the export-D85809843 branch 2 times, most recently from 6e426eb to b3cd58e Compare October 30, 2025 14:39

Nicoshev force-pushed the export-D85809843 branch 2 times, most recently from 3edaf41 to 79a8d31 Compare October 30, 2025 15:55

Nicoshev force-pushed the export-D85809843 branch from 79a8d31 to b892b38 Compare October 30, 2025 16:21

Skylion007 reviewed Oct 30, 2025

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_bfloat16_neon.h Show resolved Hide resolved

Nicoshev force-pushed the export-D85809843 branch from b892b38 to 2fa8e61 Compare October 30, 2025 16:51

Nicoshev requested a review from Skylion007 October 30, 2025 16:55

Nicoshev force-pushed the export-D85809843 branch from 2fa8e61 to 1bdb78f Compare October 30, 2025 17:12

Skylion007 approved these changes Oct 30, 2025

View reviewed changes

Nicoshev force-pushed the export-D85809843 branch from 1bdb78f to 2a7bed1 Compare October 30, 2025 19:21

malfet approved these changes Oct 30, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 30, 2025

pytorchmergebot removed the merging label Oct 30, 2025

Nicoshev force-pushed the export-D85809843 branch from 2a7bed1 to 8602ee1 Compare October 31, 2025 02:11

Nicoshev force-pushed the export-D85809843 branch from 8602ee1 to a688902 Compare October 31, 2025 05:27

Nicoshev force-pushed the export-D85809843 branch from a688902 to 8ff4b6e Compare October 31, 2025 05:28

pytorchmergebot added the merging label Oct 31, 2025

pytorchmergebot closed this in b71966f Oct 31, 2025

pytorchmergebot added Merged and removed merging labels Oct 31, 2025

Anallear added a commit to Anallear/pytorch that referenced this pull request Jan 19, 2026

Revert "[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (…

9367db2

…pytorch#166028) (pytorch#166641)" This reverts commit b71966f.

Anallear added a commit to Anallear/pytorch that referenced this pull request Jan 19, 2026

Reapply "[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (

cbfba06

pytorch#166028) (pytorch#166641)" This reverts commit 9367db2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028)#166641

[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028)#166641
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85809843

Nicoshev commented Oct 30, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 30, 2025

Uh oh!

Nicoshev commented Oct 30, 2025

Uh oh!

Uh oh!

malfet commented Oct 30, 2025

Uh oh!

pytorchmergebot commented Oct 30, 2025

Uh oh!

pytorchmergebot commented Oct 30, 2025

Uh oh!

facebook-github-bot commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Nicoshev commented Oct 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166641

✅ No Failures

Uh oh!

meta-codesync bot commented Oct 30, 2025

Uh oh!

Nicoshev commented Oct 30, 2025

Uh oh!

Uh oh!

malfet commented Oct 30, 2025

Uh oh!

pytorchmergebot commented Oct 30, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 30, 2025

Merge failed

Uh oh!

facebook-github-bot commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nicoshev commented Oct 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 30, 2025 •

edited

Loading