[Pytorch] Improve float32 erf() on aarch64#166262
[Pytorch] Improve float32 erf() on aarch64#166262Nicoshev wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166262
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8dfa4e0 with merge base 86f9f1d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
1d2329b to
7be3a98
Compare
Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: YifanYuan3, mcfi Differential Revision: D85522651
Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: YifanYuan3, mcfi Differential Revision: D85522651
7be3a98 to
8dfa4e0
Compare
|
@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)" |
There was a problem hiding this comment.
LGTM, if you want something faster and happy with accuracy within 2 ULTs, consider using exp_u20 introduced here: #161049 which is a lot faster than Sleef's implementation of exp.
I think you'd also get more dramatic speedups for erf and other vectorized ops by using implementations from Arm Optimized Routines - e.g. here's their implementation for erf: https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/erff.c
@fadara01 Thanks for the suggestion. I went with sleef's exp due to being used on x86 routines. Having results with similar errors across both platforms is a consideration. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85522651 Pull Request resolved: #166262 Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf().
We can keep erf inline this way.
Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:
Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85522651
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01