[PyTorch] Improve conversion from/to FP16 on aarch64+sve by Nicoshev · Pull Request #166306 · pytorch/pytorch

Nicoshev · 2025-10-27T14:45:31Z

Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

pytorch-bot · 2025-10-27T14:45:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166306

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8cf40e6 with merge base e214af6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-27T14:45:38Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85533271.

Nicoshev · 2025-10-27T14:55:48Z

@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)"

) Summary: Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half. We are adding a shim that makes conversion routines use autovec code for float16 We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16 before: float16_t->uint8->float16_t ===> 657.489us float16_t->int8->float16_t ===> 656.518us float16_t->int16->float16_t ===> 668.998us float16_t->int64->float16_t ===> 618.444us float16_t->double->float16_t ===> 439.728us after float16_t->uint8->float16_t ===> 181.216us ----> 263% higher throughput float16_t->int8->float16_t ===> 179.821us -----> 265% higher throughput float16_t->int16->float16_t ===> 183.417us ----> 265% higher throughput float16_t->int64->float16_t ===> 459.897us ----> 35% higher throughput float16_t->double->float16_t ===> 351.276us ---> 25% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85533271

facebook-github-bot · 2025-10-27T19:00:13Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-10-27T19:02:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half. We are adding a shim that makes conversion routines use autovec code for float16 We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16 before: float16_t->uint8->float16_t ===> 657.489us float16_t->int8->float16_t ===> 656.518us float16_t->int16->float16_t ===> 668.998us float16_t->int64->float16_t ===> 618.444us float16_t->double->float16_t ===> 439.728us after float16_t->uint8->float16_t ===> 181.216us ----> 263% higher throughput float16_t->int8->float16_t ===> 179.821us -----> 265% higher throughput float16_t->int16->float16_t ===> 183.417us ----> 265% higher throughput float16_t->int64->float16_t ===> 459.897us ----> 35% higher throughput float16_t->double->float16_t ===> 351.276us ---> 25% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85533271 Pull Request resolved: #166306 Approved by: https://github.com/mcfi, https://github.com/ezyang

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 27, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 27, 2025

Nicoshev requested a review from mcfi October 27, 2025 14:46

Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 27, 2025

pytorch-bot bot added release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category labels Oct 27, 2025

Nicoshev force-pushed the export-D85533271 branch from 3b82c83 to 8cf40e6 Compare October 27, 2025 15:07

mcfi approved these changes Oct 27, 2025

View reviewed changes

ezyang approved these changes Oct 27, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 27, 2025

pytorchmergebot added the Merged label Oct 27, 2025

pytorchmergebot closed this in 8887a33 Oct 27, 2025

pytorchmergebot removed the merging label Oct 27, 2025

Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Improve conversion from/to FP16 on aarch64+sve#166306

[PyTorch] Improve conversion from/to FP16 on aarch64+sve#166306
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85533271

Nicoshev commented Oct 27, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 27, 2025

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

facebook-github-bot commented Oct 27, 2025

Uh oh!

pytorchmergebot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Nicoshev commented Oct 27, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166306

✅ No Failures

Uh oh!

meta-codesync bot commented Oct 27, 2025

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

facebook-github-bot commented Oct 27, 2025

Uh oh!

pytorchmergebot commented Oct 27, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nicoshev commented Oct 27, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 27, 2025 •

edited

Loading