Skip to content

[PyTorch] Improve conversion from/to FP16 on aarch64+sve#166306

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85533271
Closed

[PyTorch] Improve conversion from/to FP16 on aarch64+sve#166306
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85533271

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Oct 27, 2025

Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166306

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8cf40e6 with merge base e214af6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 27, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 27, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85533271.

@Nicoshev Nicoshev requested a review from mcfi October 27, 2025 14:46
@Nicoshev Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 27, 2025
@Nicoshev
Copy link
Contributor Author

@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)"

@pytorch-bot pytorch-bot bot added release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category labels Oct 27, 2025
)

Summary:

Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us  ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us  -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us  ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us  ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us  ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

tianrengao pushed a commit that referenced this pull request Oct 30, 2025
Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us  ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us  -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us  ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us  ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us  ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271

Pull Request resolved: #166306
Approved by: https://github.com/mcfi, https://github.com/ezyang
@Nicoshev Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants