Improve bmm performance on CPU by applying TensorAccessor #20266

mingfeima · 2019-05-08T08:50:43Z

Currently bmm() has very heavy performance overhead on CPU due to construction/deconstruction of TensorImpl. Applying TensorAccessor when indexing tensor data can greatly improve the performance.

I tested this on fairseq Transformer model. Results on Xeon 6148 (20*2 cores @2.5GHz) indicate this PR improves Transformer training performance by approximately 10% (seconds per iteration reduced from 3.60 to 3.21). Considering the fact that bmm() takes only 14% of the total time, 10% overall improvement indicates bmm() itself improves by roughly 3x.

Before:

| epoch 001:   0%| | 43/25337 [02:34<25:17:11,  3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88

After:

| epoch 001:   0%| | 23/25337 [01:13<22:32:48,  3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.

mingfeima · 2019-05-08T08:56:32Z

Happen to check on VTune yesterday and realized direct Tensor indexing has quite a heavy performance overhead on such scenario, 2/3 or even 3/4 of the time on bmm. I should have combined this with #19338 beforehand...

bddppq · 2019-05-08T17:44:30Z

3x speedup sounds awesome :-) cc @BIT-silence as you are recently looking at the perf of a model that has bmm.

facebook-github-bot

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xiaomengy

Thanks for this improvement!

xiaomengy · 2019-05-08T19:17:38Z

aten/src/ATen/native/mkl/LinearAlgebra.cpp


-  const int lda = is_transposed(mat1[0]) ? mat1[0].stride(1) : mat1[0].stride(0);
-  const int ldb = is_transposed(mat2[0]) ? mat2[0].stride(1) : mat2[0].stride(0);
+  const int lda = is_transposed(mat1_acc[0]) ? mat1_acc[0].stride(1) : mat1_acc[0].stride(0);


Line over 80 characters?

up to 119 is okay

cpuhrsch · 2019-05-16T20:28:39Z

travis lint failure in python file untouched by this

facebook-github-bot · 2019-05-16T22:36:04Z

@cpuhrsch merged this pull request in 21ef4cc.

Summary: Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance. I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**. Before: ``` | epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88 ``` After: ``` | epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10. ``` Pull Request resolved: pytorch/pytorch#20266 Differential Revision: D15262201 Pulled By: cpuhrsch fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35

replace slicing with TensorAccessor on bmm

c62c90f

pytorchbot added module: mkl Related to our MKL support module: operators labels May 8, 2019

facebook-github-bot reviewed May 8, 2019

View reviewed changes

xiaomengy approved these changes May 8, 2019

View reviewed changes

xiaomengy reviewed May 8, 2019

View reviewed changes

facebook-github-bot closed this in 21ef4cc May 16, 2019

facebook-github-bot added the merged label May 16, 2019

ezyang added the open source label Jun 24, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve bmm performance on CPU by applying TensorAccessor #20266

Improve bmm performance on CPU by applying TensorAccessor #20266

Uh oh!

mingfeima commented May 8, 2019

Uh oh!

mingfeima commented May 8, 2019

Uh oh!

bddppq commented May 8, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

xiaomengy left a comment

Uh oh!

xiaomengy May 8, 2019

Uh oh!

soumith May 8, 2019

Uh oh!

cpuhrsch commented May 16, 2019

Uh oh!

facebook-github-bot commented May 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Improve bmm performance on CPU by applying TensorAccessor #20266

Improve bmm performance on CPU by applying TensorAccessor #20266

Uh oh!

Conversation

mingfeima commented May 8, 2019

Uh oh!

mingfeima commented May 8, 2019

Uh oh!

bddppq commented May 8, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy May 8, 2019

Choose a reason for hiding this comment

Uh oh!

soumith May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented May 16, 2019

Uh oh!

facebook-github-bot commented May 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants