-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve bmm performance on CPU by applying TensorAccessor #20266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Happen to check on VTune yesterday and realized direct Tensor indexing has quite a heavy performance overhead on such scenario, 2/3 or even 3/4 of the time on bmm. I should have combined this with #19338 beforehand... |
|
3x speedup sounds awesome :-) cc @BIT-silence as you are recently looking at the perf of a model that has bmm. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
xiaomengy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this improvement!
|
|
||
| const int lda = is_transposed(mat1[0]) ? mat1[0].stride(1) : mat1[0].stride(0); | ||
| const int ldb = is_transposed(mat2[0]) ? mat2[0].stride(1) : mat2[0].stride(0); | ||
| const int lda = is_transposed(mat1_acc[0]) ? mat1_acc[0].stride(1) : mat1_acc[0].stride(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line over 80 characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to 119 is okay
|
travis lint failure in python file untouched by this |
Summary: Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance. I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**. Before: ``` | epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88 ``` After: ``` | epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10. ``` Pull Request resolved: pytorch/pytorch#20266 Differential Revision: D15262201 Pulled By: cpuhrsch fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35
Currently
bmm()has very heavy performance overhead on CPU due to construction/deconstruction ofTensorImpl. ApplyingTensorAccessorwhen indexing tensor data can greatly improve the performance.I tested this on
fairseqTransformer model. Results on Xeon 6148 (20*2 cores @2.5GHz) indicate this PR improves Transformer training performance by approximately 10% (seconds per iteration reduced from 3.60 to 3.21). Considering the fact thatbmm()takes only 14% of the total time, 10% overall improvement indicatesbmm()itself improves by roughly 3x.Before:
After: