[Pytorch] Improve float32 erf() on aarch64 by Nicoshev · Pull Request #166262 · pytorch/pytorch

Nicoshev · 2025-10-26T14:10:31Z

Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().

We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85522651

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

pytorch-bot · 2025-10-26T14:10:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166262

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8dfa4e0 with merge base 86f9f1d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-26T14:10:39Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85522651.

Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: YifanYuan3, mcfi Differential Revision: D85522651

Nicoshev · 2025-10-27T01:26:37Z

@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)"

fadara01

LGTM, if you want something faster and happy with accuracy within 2 ULTs, consider using exp_u20 introduced here: #161049 which is a lot faster than Sleef's implementation of exp.

I think you'd also get more dramatic speedups for erf and other vectorized ops by using implementations from Arm Optimized Routines - e.g. here's their implementation for erf: https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/erff.c

Nicoshev · 2025-10-27T13:23:52Z

LGTM, if you want something faster and happy with accuracy within 2 ULTs, consider using exp_u20 introduced here: #161049 which is a lot faster than Sleef's implementation of exp.

I think you'd also get more dramatic speedups for erf and other vectorized ops by using implementations from Arm Optimized Routines - e.g. here's their implementation for erf: https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/erff.c

@fadara01 Thanks for the suggestion. I went with sleef's exp due to being used on x86 routines. Having results with similar errors across both platforms is a consideration.

facebook-github-bot · 2025-10-27T14:48:17Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-10-27T14:49:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85522651 Pull Request resolved: #166262 Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 26, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 26, 2025

Nicoshev force-pushed the export-D85522651 branch from 1d2329b to 7be3a98 Compare October 27, 2025 01:22

Nicoshev force-pushed the export-D85522651 branch from 7be3a98 to 8dfa4e0 Compare October 27, 2025 01:23

Nicoshev requested review from Skylion007 and aditew01 October 27, 2025 01:23

Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 27, 2025

Nicoshev requested review from fadara01 and jgong5 October 27, 2025 01:24

pytorch-bot bot added release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category labels Oct 27, 2025

Nicoshev requested a review from mcfi October 27, 2025 02:36

fadara01 approved these changes Oct 27, 2025

View reviewed changes

fadara01 added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 27, 2025

jgong5 approved these changes Oct 27, 2025

View reviewed changes

aditew01 approved these changes Oct 27, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 27, 2025

pytorchmergebot closed this in e214af6 Oct 27, 2025

pytorchmergebot added the Merged label Oct 27, 2025

pytorchmergebot removed the merging label Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Improve float32 erf() on aarch64#166262

[Pytorch] Improve float32 erf() on aarch64#166262
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85522651

Nicoshev commented Oct 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 26, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 26, 2025

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

fadara01 left a comment •

edited

Loading

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

facebook-github-bot commented Oct 27, 2025

Uh oh!

pytorchmergebot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Nicoshev commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166262

✅ No Failures

Uh oh!

meta-codesync bot commented Oct 26, 2025

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

fadara01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nicoshev commented Oct 27, 2025

Uh oh!

facebook-github-bot commented Oct 27, 2025

Uh oh!

pytorchmergebot commented Oct 27, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Nicoshev commented Oct 26, 2025 •

edited

Loading

pytorch-bot bot commented Oct 26, 2025 •

edited

Loading

fadara01 left a comment •

edited

Loading