Skip to content

[Pytorch] Use exp_u20 for aarch64's erf#166594

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85730452
Closed

[Pytorch] Use exp_u20 for aarch64's erf#166594
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D85730452

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Oct 29, 2025

Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 29, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166594

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 2c1cbce with merge base b4403bf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Oct 29, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85730452.

@Nicoshev Nicoshev requested a review from mcfi October 30, 2025 00:07
@Nicoshev Nicoshev added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 30, 2025
@Nicoshev
Copy link
Contributor Author

@pytorchbot label "topic: perf improvements" "release notes: cpu (aarch64)"

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 30, 2025

Didn't find following labels among repository labels: topic: perf improvements

@pytorch-bot pytorch-bot bot added the release notes: cpu (aarch64) release notes category for aarch64, arm, etc. label Oct 30, 2025
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
Summary:

After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()

We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.

Benchmarks show about 108% higher throughput on clang-19:

Before:
f32 erf: 2539.179us

After:
f32 erf: 1221.083us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452
@Nicoshev Nicoshev force-pushed the export-D85730452 branch 2 times, most recently from 9235e06 to 36fb135 Compare October 30, 2025 01:00
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
Summary:

After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()

We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.

Benchmarks show about 108% higher throughput on clang-19:

Before:
f32 erf: 2539.179us

After:
f32 erf: 1221.083us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
Summary:

After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()

We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.

Benchmarks show about 108% higher throughput on clang-19:

Before:
f32 erf: 2539.179us

After:
f32 erf: 1221.083us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452
@Nicoshev Nicoshev force-pushed the export-D85730452 branch 2 times, most recently from 00b4e0b to f0cfce4 Compare October 30, 2025 01:08
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Oct 30, 2025
Summary:

After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()

We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.

Benchmarks show about 108% higher throughput on clang-19, when processing a tensor of 1M elements:

Before:
f32 erf: 2539.179us

After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452
Summary:

After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()

We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.

Benchmarks show about 91% higher throughput on clang-19, when processing a tensor of 1M elements:

Before:
f32 erf: 2539.179us

After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452
Copy link
Collaborator

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - You'll see an even higher speedup once we enable SVE128 vectorizer (SVE128 exp_u20() is faster than Neon's).

Also, this implementation doesn't come from Arm Compute Library (ACL), it's from Arm Optimized Routines (AOR)

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Nicoshev Nicoshev added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Oct 31, 2025
BoyuanFeng pushed a commit that referenced this pull request Oct 31, 2025
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

Pull Request resolved: #166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

Pull Request resolved: pytorch#166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants