Skip to content

[Pytorch] Enable autovec on aarch64 for type conversion#166049

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85213420
Closed

[Pytorch] Enable autovec on aarch64 for type conversion#166049
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85213420

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Oct 22, 2025

Summary:
Implementing autovec template for type conversions on aarch64-NEON

Generated code can be seen here: https://godbolt.org/z/1K6T1d9TE

We've seen significant performance improvements for converting to and from bytes, compiling using clang with -march=armv9-a+sve2:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Differential Revision: D85213420

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166049

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 4614763 with merge base 13cda9b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 22, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 22, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85213420.

@Nicoshev
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 22, 2025
@Nicoshev Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 22, 2025
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
@Nicoshev Nicoshev force-pushed the export-D85213420 branch 2 times, most recently from 34ffa32 to 7b0e870 Compare October 22, 2025 04:36
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
@Nicoshev Nicoshev requested a review from malfet October 22, 2025 04:50
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 22, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85213420
@malfet malfet added the ciflow/op-benchmark Trigger microbenchmark for operations. label Oct 22, 2025
@malfet malfet requested a review from aditew01 October 22, 2025 14:26
@malfet malfet added topic: improvements topic category release notes: cpu (aarch64) release notes category for aarch64, arm, etc. labels Oct 22, 2025
@malfet
Copy link
Contributor

malfet commented Oct 22, 2025

@pytorchbot label "topic: not user facing"

If it's not user facing, then what's the point of this change? (Labels should be "release notes: cpu (aarch64)" / "topic: performance"

Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 24, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85213420
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 24, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85213420
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85213420
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 24, 2025
Summary:

Implementing autovec template for type conversions on aarch64-NEON

We've seen significant performance improvements for converting to and from bytes:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85213420
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title claims throughput improvements, but does not update any benchmark files.

Moreover, convertImpl feels just like a code duplicate of default template implementation already, isn't it?

@Nicoshev
Copy link
Contributor Author

PR title claims throughput improvements, but does not update any benchmark files.

Moreover, convertImpl feels just like a code duplicate of default template implementation already, isn't it?

@malfet Improvements are observed when targeting SVE, so there was no point in adding a specific benchmark for the OS repo. It is not a duplicate of the existing implementation, as the internally supplied benchmarks show

@Nicoshev Nicoshev requested review from malfet and mcfi and removed request for malfet October 24, 2025 19:54
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 25, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

@Nicoshev
Copy link
Contributor Author

@pytorchbot merge -f "Benchmark failures are a pre-existing issue"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Nicoshev Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025
@ezyang
Copy link
Contributor

ezyang commented Oct 31, 2025

@pytorchbot revert -c nosignal -m "broke arm builds"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 166049 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit b31bad1b8f1331bf43d47f46602cf6141db56844 returned non-zero exit code 1

Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_convert.h
CONFLICT (content): Merge conflict in aten/src/ATen/cpu/vec/vec128/vec128_convert.h
Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_float_neon.h
Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h
error: could not revert b31bad1b8f1... [Pytorch] Enable autovec on aarch64 for type conversion (#166049)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@Nicoshev
Copy link
Contributor Author

@ezyang Issue got fixed in #166739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/op-benchmark Trigger microbenchmark for operations. ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: improvements topic category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants