Improve bmm() performance on CPU when input tensor is non-contiguous #19338

mingfeima · 2019-04-17T05:27:44Z

This PR aims to improve Transformer performance on CPU, bmm() is one of the major bottlenecks now.

Current logic of bmm() on CPU only uses MKL batch gemm when the inputs A and B are contiguous or transposed. So when A or B is a slice of a larger tensor, it falls to a slower path.

A and B are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that A.stride(1) == 1 || A.stride(2) == 1 and B.stride(1) == || B.stride(2) == 1.

From fairseq implementation of Transformer, multi-head attention has two places to call bmm(), here and here, q, k, v are all slices from larger tensor. So the bmm() falls to slow path at the moment.

Results on Xeon 6148 (20*2 cores @2.5GHz) indicate this PR improves Transformer training performance by 48% (seconds per iteration reduced from 5.48 to 3.70), the inference performance should also be boosted.

Before:

| epoch 001:   0%| | 27/25337 [02:27<38:31:26,  5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133,

After:

| epoch 001:   0%| | 97/25337 [05:58<25:55:49,  3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908,

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-04-18T14:34:56Z

@soumith merged this pull request in b8fb6ea.

…(#19338) Summary: This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now. Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path. `A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`. From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted. Before: ``` | epoch 001: 0%| | 27/25337 [02:27<38:31:26, 5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133, ``` After: ``` | epoch 001: 0%| | 97/25337 [05:58<25:55:49, 3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908, ``` Pull Request resolved: pytorch/pytorch#19338 Differential Revision: D14986346 Pulled By: soumith fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08

…ytorch#19338) Summary: This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now. Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path. `A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`. From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted. Before: ``` | epoch 001: 0%| | 27/25337 [02:27<38:31:26, 5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133, ``` After: ``` | epoch 001: 0%| | 97/25337 [05:58<25:55:49, 3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908, ``` Pull Request resolved: pytorch#19338 Differential Revision: D14986346 Pulled By: soumith fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08

loosen condition on bmm using mkl batch gemm

ddc27b2

VitalyFedyunin self-requested a review April 17, 2019 17:55

VitalyFedyunin added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 17, 2019

soumith approved these changes Apr 17, 2019

View reviewed changes

facebook-github-bot reviewed Apr 17, 2019

View reviewed changes

facebook-github-bot closed this in b8fb6ea Apr 18, 2019

facebook-github-bot added the merged label Apr 18, 2019

mingfeima mentioned this pull request May 8, 2019

Improve bmm performance on CPU by applying TensorAccessor #20266

Closed

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve bmm() performance on CPU when input tensor is non-contiguous #19338

Improve bmm() performance on CPU when input tensor is non-contiguous #19338

Uh oh!

mingfeima commented Apr 17, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Apr 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Improve bmm() performance on CPU when input tensor is non-contiguous #19338

Improve bmm() performance on CPU when input tensor is non-contiguous #19338

Uh oh!

Conversation

mingfeima commented Apr 17, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants