[PyTorch] Improve conversion from/to bool on aarch64+sve#166330
Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Closed
[PyTorch] Improve conversion from/to bool on aarch64+sve#166330Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166330
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bb6f1bf with merge base 2dc5645 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
Author
|
@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)" |
9721e28 to
10df531
Compare
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 27, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 27, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
b5b988c to
1686b10
Compare
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 27, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
mcfi
approved these changes
Oct 27, 2025
1686b10 to
15f7a1c
Compare
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 28, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
15f7a1c to
b80567f
Compare
pytorch-bot bot
pushed a commit
that referenced
this pull request
Oct 28, 2025
Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 28, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
b80567f to
e9ddff5
Compare
Nicoshev
added a commit
to Nicoshev/pytorch
that referenced
this pull request
Oct 28, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
e9ddff5 to
0d48af2
Compare
Skylion007
reviewed
Oct 28, 2025
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
0d48af2 to
bb6f1bf
Compare
Contributor
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Collaborator
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
We are adding autovec routines to convert to/from boolean values
We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16
before:
bool->uint8->bool ===> 447.854us
bool->int8->bool ===> 445.609us
bool->int16->bool ===> 312.425us
bool->int32->bool ===> 324.368us
bool->float->bool ===> 320.929us
bool->float16->bool ===> 290.825us
bool->bfloat16->bool ===> 437.250us
after
bool->uint8->bool ===> 78.988us ----> 467% higher throughput
bool->int8->bool ===> 78.494us -----> 468% higher throughput
bool->int16->bool ===> 107.993us ----> 189% higher throughput
bool->int32->bool ===> 186.887us -----> 74% higher throughput
bool->float->bool ===> 188.048us ------> 71% higher throughput
bool->float16->bool ===> 102.789us --> 183% higher throughput
bool->bfloat16->bool ===> 105.809us -> 313% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Reviewed By: mcfi
Differential Revision: D85533284
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01