[Pytorch] Use exp_u20 for aarch64's erf by Nicoshev · Pull Request #166594 · pytorch/pytorch

Nicoshev · 2025-10-29T23:40:31Z

Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

pytorch-bot · 2025-10-29T23:40:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166594

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long queue for ROCM runners, also B200 and XPU queueing is observed

✅ No Failures

As of commit 2c1cbce with merge base b4403bf ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-29T23:40:38Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85730452.

Nicoshev · 2025-10-30T00:10:09Z

@pytorchbot label "topic: perf improvements" "release notes: cpu (aarch64)"

pytorch-bot · 2025-10-30T00:10:16Z

Didn't find following labels among repository labels: topic: perf improvements

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1221.083us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19, when processing a tensor of 1M elements: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 91% higher throughput on clang-19, when processing a tensor of 1M elements: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452

fadara01

LGTM - You'll see an even higher speedup once we enable SVE128 vectorizer (SVE128 exp_u20() is faster than Neon's).

Also, this implementation doesn't come from Arm Compute Library (ACL), it's from Arm Optimized Routines (AOR)

facebook-github-bot · 2025-10-30T18:01:21Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-10-30T18:03:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: #166594 Approved by: https://github.com/mcfi, https://github.com/fadara01

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: pytorch#166594 Approved by: https://github.com/mcfi, https://github.com/fadara01

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 29, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 29, 2025

Nicoshev requested a review from mcfi October 30, 2025 00:07

Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 30, 2025

pytorch-bot bot added the release notes: cpu (aarch64) release notes category for aarch64, arm, etc. label Oct 30, 2025

mcfi approved these changes Oct 30, 2025

View reviewed changes

Nicoshev force-pushed the export-D85730452 branch 2 times, most recently from 9235e06 to 36fb135 Compare October 30, 2025 01:00

Nicoshev force-pushed the export-D85730452 branch 2 times, most recently from 00b4e0b to f0cfce4 Compare October 30, 2025 01:08

Nicoshev force-pushed the export-D85730452 branch from f0cfce4 to 2c1cbce Compare October 30, 2025 01:27

aditew01 requested a review from fadara01 October 30, 2025 17:07

fadara01 approved these changes Oct 30, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 30, 2025

pytorchmergebot added the Merged label Oct 30, 2025

pytorchmergebot closed this in a5c3c08 Oct 30, 2025

pytorchmergebot removed the merging label Oct 30, 2025

Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Use exp_u20 for aarch64's erf#166594

[Pytorch] Use exp_u20 for aarch64's erf#166594
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85730452

Nicoshev commented Oct 29, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 29, 2025

Uh oh!

Nicoshev commented Oct 30, 2025

Uh oh!

pytorch-bot bot commented Oct 30, 2025

Uh oh!

fadara01 left a comment

Uh oh!

facebook-github-bot commented Oct 30, 2025

Uh oh!

pytorchmergebot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Nicoshev commented Oct 29, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166594

❗ 1 Active SEVs

✅ No Failures

Uh oh!

meta-codesync bot commented Oct 29, 2025

Uh oh!

Nicoshev commented Oct 30, 2025

Uh oh!

pytorch-bot bot commented Oct 30, 2025

Uh oh!

fadara01 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 30, 2025

Uh oh!

pytorchmergebot commented Oct 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nicoshev commented Oct 29, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 29, 2025 •

edited

Loading