[Pytorch] Use exp_u20 for aarch64's erf#166594
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166594
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 2c1cbce with merge base b4403bf ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: perf improvements" "release notes: cpu (aarch64)" |
|
Didn't find following labels among repository labels: topic: perf improvements |
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1221.083us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452
9235e06 to
36fb135
Compare
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1221.083us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1221.083us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452
00b4e0b to
f0cfce4
Compare
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 108% higher throughput on clang-19, when processing a tensor of 1M elements: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We've moved ACL's exp implementation to an inline function, and call it from exp_u20() and erf(). We can keep erf inline this way. Benchmarks show about 91% higher throughput on clang-19, when processing a tensor of 1M elements: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452
f0cfce4 to
2c1cbce
Compare
fadara01
left a comment
There was a problem hiding this comment.
LGTM - You'll see an even higher speedup once we enable SVE128 vectorizer (SVE128 exp_u20() is faster than Neon's).
Also, this implementation doesn't come from Arm Compute Library (ACL), it's from Arm Optimized Routines (AOR)
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: #166594 Approved by: https://github.com/mcfi, https://github.com/fadara01
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: pytorch#166594 Approved by: https://github.com/mcfi, https://github.com/fadara01
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.
Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:
Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85730452
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01