Skip to content

[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028)#166641

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85809843
Closed

[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028)#166641
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85809843

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Oct 30, 2025

Summary:

PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 30, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166641

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8ff4b6e with merge base 85b035c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Oct 30, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85809843.

@Nicoshev
Copy link
Contributor Author

@pytorchbot label "release notes: cpu (aarch64)"

@Nicoshev Nicoshev requested a review from malfet October 30, 2025 14:15
@pytorch-bot pytorch-bot bot added the release notes: cpu (aarch64) release notes category for aarch64, arm, etc. label Oct 30, 2025
@Nicoshev Nicoshev requested a review from seemethere October 30, 2025 14:15
@Nicoshev Nicoshev added ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request labels Oct 30, 2025
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 74.986us ---> 52% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
@Nicoshev Nicoshev force-pushed the export-D85809843 branch 2 times, most recently from 6e426eb to b3cd58e Compare October 30, 2025 14:39
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 74.986us ---> 52% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 74.986us ---> 52% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
@Nicoshev Nicoshev force-pushed the export-D85809843 branch 2 times, most recently from 3edaf41 to 79a8d31 Compare October 30, 2025 15:55
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 74.986us ---> 52% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 reciprocal: 242.927us ---> 8% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 reciprocal: 242.927us ---> 8% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
@Nicoshev Nicoshev requested a review from Skylion007 October 30, 2025 16:55
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
@malfet
Copy link
Contributor

malfet commented Oct 30, 2025

@pytorchbot merge -f "All relevant tests are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 31, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 31, 2025
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
…h#166641)

Summary:


PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

BoyuanFeng pushed a commit that referenced this pull request Oct 31, 2025
… (#166641)

Summary:

PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843

Pull Request resolved: #166641
Approved by: https://github.com/Skylion007, https://github.com/malfet
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
…h#166028) (pytorch#166641)

Summary:

PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843

Pull Request resolved: pytorch#166641
Approved by: https://github.com/Skylion007, https://github.com/malfet
Anallear added a commit to Anallear/pytorch that referenced this pull request Jan 19, 2026
Anallear added a commit to Anallear/pytorch that referenced this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants