[Pytorch] Enable autovec on aarch64 for type conversion by Nicoshev · Pull Request #166049 · pytorch/pytorch

Nicoshev · 2025-10-22T03:49:05Z

Summary:
Implementing autovec template for type conversions on aarch64-NEON

Generated code can be seen here: https://godbolt.org/z/1K6T1d9TE

We've seen significant performance improvements for converting to and from bytes, compiling using clang with -march=armv9-a+sve2:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Differential Revision: D85213420

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

pytorch-bot · 2025-10-22T03:49:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166049

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 4614763 with merge base 13cda9b ():

NEW FAILURES - The following jobs have failed:

operator_benchmark / aarch64-opbenchmark-test / test (cpu_operator_benchmark_short, 1, 1, linux.arm64.m8g.4xlarge) (gh)
add_M1_N1_K1_cpu
operator_benchmark / x86-opbenchmark-test / test (cpu_operator_benchmark_short, 1, 1, linux.12xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-22T03:49:11Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85213420.

Nicoshev · 2025-10-22T03:56:44Z

@pytorchbot label "topic: not user facing"

Summary: Implementing autovec template for type conversions on aarch64-NEON We've seen significant performance improvements for converting to and from bytes: Before float->uint8->float ===> 683.212us float->int8->float ===> 687.846us int32->uint8->int32 ===> 497.121us int32->int8->int32 ===> 481.889us After: float->uint8->float ===> 198.204us ----> 245% higher throughput float->int8->float ===> 200.241us ----> 244% higher throughput int32->uint8->int32 ===> 197.970us ----> 151% higher throughput int32->int8->int32 ===> 198.206us ----> 143% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85213420

aten/src/ATen/cpu/vec/vec128/vec128_convert.h

malfet · 2025-10-22T14:27:52Z

@pytorchbot label "topic: not user facing"

If it's not user facing, then what's the point of this change? (Labels should be "release notes: cpu (aarch64)" / "topic: performance"

Summary: Implementing autovec template for type conversions on aarch64-NEON We've seen significant performance improvements for converting to and from bytes: Before float->uint8->float ===> 683.212us float->int8->float ===> 687.846us int32->uint8->int32 ===> 497.121us int32->int8->int32 ===> 481.889us After: float->uint8->float ===> 198.204us ----> 245% higher throughput float->int8->float ===> 200.241us ----> 244% higher throughput int32->uint8->int32 ===> 197.970us ----> 151% higher throughput int32->int8->int32 ===> 198.206us ----> 143% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85213420

malfet

PR title claims throughput improvements, but does not update any benchmark files.

Moreover, convertImpl feels just like a code duplicate of default template implementation already, isn't it?

aten/src/ATen/cpu/vec/vec128/vec128_convert.h

Nicoshev · 2025-10-24T19:21:33Z

PR title claims throughput improvements, but does not update any benchmark files.

Moreover, convertImpl feels just like a code duplicate of default template implementation already, isn't it?

@malfet Improvements are observed when targeting SVE, so there was no point in adding a specific benchmark for the OS repo. It is not a duplicate of the existing implementation, as the internally supplied benchmarks show

facebook-github-bot · 2025-10-25T00:04:51Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorch-bot · 2025-10-25T00:04:56Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Nicoshev · 2025-10-25T02:53:59Z

@pytorchbot merge -f "Benchmark failures are a pre-existing issue"

pytorchmergebot · 2025-10-25T02:55:35Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ezyang · 2025-10-31T21:54:08Z

@pytorchbot revert -c nosignal -m "broke arm builds"

pytorchmergebot · 2025-10-31T21:55:44Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-10-31T21:55:46Z

Reverting PR 166049 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit b31bad1b8f1331bf43d47f46602cf6141db56844 returned non-zero exit code 1

Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_convert.h
CONFLICT (content): Merge conflict in aten/src/ATen/cpu/vec/vec128/vec128_convert.h
Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_float_neon.h
Auto-merging aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h
error: could not revert b31bad1b8f1... [Pytorch] Enable autovec on aarch64 for type conversion (#166049)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

Nicoshev · 2025-10-31T22:00:20Z

@ezyang Issue got fixed in #166739

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 22, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 22, 2025

pytorch-bot bot added the topic: not user facing topic category label Oct 22, 2025

Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 22, 2025

Nicoshev force-pushed the export-D85213420 branch 2 times, most recently from 34ffa32 to 7b0e870 Compare October 22, 2025 04:36

Nicoshev force-pushed the export-D85213420 branch from 7b0e870 to cf3fe30 Compare October 22, 2025 04:36

Nicoshev force-pushed the export-D85213420 branch from cf3fe30 to edb1c73 Compare October 22, 2025 04:48

Nicoshev requested a review from malfet October 22, 2025 04:50

Nicoshev force-pushed the export-D85213420 branch from edb1c73 to 06ef008 Compare October 22, 2025 05:02

Nicoshev force-pushed the export-D85213420 branch from 06ef008 to cac4953 Compare October 22, 2025 06:17

malfet added the ciflow/op-benchmark Trigger microbenchmark for operations. label Oct 22, 2025

malfet reviewed Oct 22, 2025

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_convert.h Show resolved Hide resolved

aten/src/ATen/cpu/vec/vec128/vec128_convert.h Outdated Show resolved Hide resolved

malfet requested a review from aditew01 October 22, 2025 14:26

malfet added topic: improvements topic category release notes: cpu (aarch64) release notes category for aarch64, arm, etc. labels Oct 22, 2025

Nicoshev requested a review from Ryo-not-rio October 22, 2025 14:36

Nicoshev force-pushed the export-D85213420 branch from cac4953 to 0b4be2e Compare October 22, 2025 15:32

Nicoshev force-pushed the export-D85213420 branch from 38d294a to 21031d2 Compare October 24, 2025 05:05

Nicoshev force-pushed the export-D85213420 branch from 21031d2 to 9287ff7 Compare October 24, 2025 19:00

malfet requested changes Oct 24, 2025

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_convert.h Show resolved Hide resolved

aten/src/ATen/cpu/vec/vec128/vec128_convert.h Show resolved Hide resolved

Nicoshev force-pushed the export-D85213420 branch from 9287ff7 to 4614763 Compare October 24, 2025 19:00

Nicoshev requested review from malfet and mcfi and removed request for malfet October 24, 2025 19:54

ezyang approved these changes Oct 24, 2025

View reviewed changes

mcfi approved these changes Oct 24, 2025

View reviewed changes

aditew01 approved these changes Oct 24, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 25, 2025

pytorchmergebot closed this in b31bad1 Oct 25, 2025

pytorchmergebot added Merged and removed merging labels Oct 25, 2025

Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025

robert-hardwick mentioned this pull request Oct 31, 2025

Aarch64 unit test failures from nightly/manylinux build, jammy upgrade to gcc13 needed #166736

Closed

Conversation

Nicoshev commented Oct 22, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166049

❌ 2 New Failures

Uh oh!

meta-codesync bot commented Oct 22, 2025

Uh oh!

Nicoshev commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

malfet commented Oct 22, 2025

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Nicoshev commented Oct 24, 2025

Uh oh!

facebook-github-bot commented Oct 25, 2025

Uh oh!

pytorch-bot bot commented Oct 25, 2025

Uh oh!

Nicoshev commented Oct 25, 2025

Uh oh!

pytorchmergebot commented Oct 25, 2025

Merge started

Uh oh!

ezyang commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Reverting PR 166049 failed

Uh oh!

Nicoshev commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Nicoshev commented Oct 22, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 22, 2025 •

edited

Loading