Skip to content

[Pytorch] Improve float32 erf() on aarch64#166262

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85522651
Closed

[Pytorch] Improve float32 erf() on aarch64#166262
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85522651

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Oct 26, 2025

Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().

We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85522651

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166262

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8dfa4e0 with merge base 86f9f1d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 26, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 26, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85522651.

Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 27, 2025
Summary:

The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.

AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.

We've observed a throughput increase of 25%, when tested on tensors containing 1M elements

Before:
f32 erf: 3175.977us

After:
f32 erf: 2539.446us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: YifanYuan3, mcfi

Differential Revision: D85522651
Summary:

The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.

AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.

We've observed a throughput increase of 25%, when tested on tensors containing 1M elements

Before:
f32 erf: 3175.977us

After:
f32 erf: 2539.446us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: YifanYuan3, mcfi

Differential Revision: D85522651
@Nicoshev Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 27, 2025
@Nicoshev Nicoshev requested review from fadara01 and jgong5 October 27, 2025 01:24
@Nicoshev
Copy link
Contributor Author

@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)"

@pytorch-bot pytorch-bot bot added release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category labels Oct 27, 2025
@Nicoshev Nicoshev requested a review from mcfi October 27, 2025 02:36
Copy link
Collaborator

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you want something faster and happy with accuracy within 2 ULTs, consider using exp_u20 introduced here: #161049 which is a lot faster than Sleef's implementation of exp.

I think you'd also get more dramatic speedups for erf and other vectorized ops by using implementations from Arm Optimized Routines - e.g. here's their implementation for erf: https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/erff.c

@fadara01 fadara01 added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 27, 2025
@Nicoshev
Copy link
Contributor Author

LGTM, if you want something faster and happy with accuracy within 2 ULTs, consider using exp_u20 introduced here: #161049 which is a lot faster than Sleef's implementation of exp.

I think you'd also get more dramatic speedups for erf and other vectorized ops by using implementations from Arm Optimized Routines - e.g. here's their implementation for erf: https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/erff.c

@fadara01 Thanks for the suggestion. I went with sleef's exp due to being used on x86 routines. Having results with similar errors across both platforms is a consideration.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

tianrengao pushed a commit that referenced this pull request Oct 30, 2025
Summary:
The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.

AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.

We've observed a throughput increase of 25%, when tested on tensors containing 1M elements

Before:
f32 erf: 3175.977us

After:
f32 erf: 2539.446us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85522651

Pull Request resolved: #166262
Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc. topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants