Skip to content

Conversation

@mingfeima
Copy link
Collaborator

This PR aims to improve Transformer performance on CPU, bmm() is one of the major bottlenecks now.

Current logic of bmm() on CPU only uses MKL batch gemm when the inputs A and B are contiguous or transposed. So when A or B is a slice of a larger tensor, it falls to a slower path.

A and B are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that A.stride(1) == 1 || A.stride(2) == 1 and B.stride(1) == || B.stride(2) == 1.

From fairseq implementation of Transformer, multi-head attention has two places to call bmm(), here and here, q, k, v are all slices from larger tensor. So the bmm() falls to slow path at the moment.

Results on Xeon 6148 (20*2 cores @2.5GHz) indicate this PR improves Transformer training performance by 48% (seconds per iteration reduced from 5.48 to 3.70), the inference performance should also be boosted.

Before:

| epoch 001:   0%| | 27/25337 [02:27<38:31:26,  5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133,

After:

| epoch 001:   0%| | 97/25337 [05:58<25:55:49,  3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908,

@VitalyFedyunin VitalyFedyunin self-requested a review April 17, 2019 17:55
@VitalyFedyunin VitalyFedyunin added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 17, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in b8fb6ea.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Apr 18, 2019
…(#19338)

Summary:
This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now.

Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path.

`A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`.

From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment.

Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted.

Before:
```
| epoch 001:   0%| | 27/25337 [02:27<38:31:26,  5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133,
```
After:
```
| epoch 001:   0%| | 97/25337 [05:58<25:55:49,  3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908,
```
Pull Request resolved: pytorch/pytorch#19338

Differential Revision: D14986346

Pulled By: soumith

fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08
zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
…ytorch#19338)

Summary:
This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now.

Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path.

`A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`.

From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment.

Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted.

Before:
```
| epoch 001:   0%| | 27/25337 [02:27<38:31:26,  5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133,
```
After:
```
| epoch 001:   0%| | 97/25337 [05:58<25:55:49,  3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908,
```
Pull Request resolved: pytorch#19338

Differential Revision: D14986346

Pulled By: soumith

fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants