Skip to content

Conversation

@mingfeima
Copy link
Collaborator

Currently bmm() has very heavy performance overhead on CPU due to construction/deconstruction of TensorImpl. Applying TensorAccessor when indexing tensor data can greatly improve the performance.

I tested this on fairseq Transformer model. Results on Xeon 6148 (20*2 cores @2.5GHz) indicate this PR improves Transformer training performance by approximately 10% (seconds per iteration reduced from 3.60 to 3.21). Considering the fact that bmm() takes only 14% of the total time, 10% overall improvement indicates bmm() itself improves by roughly 3x.

Before:

| epoch 001:   0%| | 43/25337 [02:34<25:17:11,  3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88

After:

| epoch 001:   0%| | 23/25337 [01:13<22:32:48,  3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.

@pytorchbot pytorchbot added module: mkl Related to our MKL support module: operators labels May 8, 2019
@mingfeima
Copy link
Collaborator Author

Happen to check on VTune yesterday and realized direct Tensor indexing has quite a heavy performance overhead on such scenario, 2/3 or even 3/4 of the time on bmm. I should have combined this with #19338 beforehand...

@bddppq
Copy link
Contributor

bddppq commented May 8, 2019

3x speedup sounds awesome :-) cc @BIT-silence as you are recently looking at the perf of a model that has bmm.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@xiaomengy xiaomengy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this improvement!


const int lda = is_transposed(mat1[0]) ? mat1[0].stride(1) : mat1[0].stride(0);
const int ldb = is_transposed(mat2[0]) ? mat2[0].stride(1) : mat2[0].stride(0);
const int lda = is_transposed(mat1_acc[0]) ? mat1_acc[0].stride(1) : mat1_acc[0].stride(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line over 80 characters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to 119 is okay

@cpuhrsch
Copy link
Contributor

travis lint failure in python file untouched by this

@facebook-github-bot
Copy link
Contributor

@cpuhrsch merged this pull request in 21ef4cc.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 16, 2019
Summary:
Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance.

I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**.

Before:
```
| epoch 001:   0%| | 43/25337 [02:34<25:17:11,  3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88
```

After:
```
| epoch 001:   0%| | 23/25337 [01:13<22:32:48,  3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.
```
Pull Request resolved: pytorch/pytorch#20266

Differential Revision: D15262201

Pulled By: cpuhrsch

fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: mkl Related to our MKL support open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants