[pytorch] Improve/fix heuristics for using mkldnn vs native conv #46675

bertmaher · 2020-10-21T20:24:40Z

Summary:
We've found a few heuristics for using/not using mkldnn that seem to generally
improve performance on 2d and 3d conv.

1x1 convolutions are basically batch matmuls, and mkldnn's implementation
appears to usually be slower than using the native conv (which lowers to
aten::mm, which in turn calls mkl gemm).
3d conv was often not using mkldnn even when it's beneficial, because the
heuristic was checking the kernel depth rather than height/width. mkldnn
seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are
allowed by the new heuristic.

Test Plan:
Bento notebooks showing before/after:
before: https://www.internalfb.com/intern/anp/view/?id=38089
after: https://www.internalfb.com/intern/anp/view/?id=380893

I need to figure out the right way to share notebooks, and also probably clean
this up into a simple text table for GitHub...

Also, I've run a conv fuzzer, and it generally supports these heuristics. I'm
not sure how to best share the data since there's a lot of it (I tried about
50k parameter combinations).

For the 1x1 case, about 70% were faster with "native". I played with
constructing a decision tree (using scikit-learn) and found that switching back
to MKL for batch size > 16 might be slightly better still, but I'm not sure
it's worth complicating the heuristic.

Reviewed By: jansel

Differential Revision: D24452071

Thanks to:
@jansel for finding these convolution shapes and showing that they were underperforming, and developing optimized implementations
@ngimel for realizing that the MKL heuristics were broken in the 3d conv case
@robieta for providing the conv fuzzing and benchmarking tools

facebook-github-bot · 2020-10-21T20:25:00Z

This pull request was exported from Phabricator. Differential Revision: D24452071

facebook-github-bot · 2020-10-21T20:40:46Z

This pull request was exported from Phabricator. Differential Revision: D24452071

dr-ci · 2020-10-21T20:42:39Z

💊 CI failures summary and remediations

As of commit 2079f69 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

2/2 broken upstream at merge base 475b4e3 on Oct 21 from 1:49pm to 2:06pm PDT (1 commit; 475b4e3 - ff0e20b)

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_backward_compatibility_check_test since Oct 21
- 🔁 rerun

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_bionic_py3_8_gcc9_build on Oct 21 from 1:49pm to 2:06pm PDT (1 commit; 475b4e3 - ff0e20b)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

facebook-github-bot · 2020-10-21T20:58:46Z

This pull request was exported from Phabricator. Differential Revision: D24452071

facebook-github-bot · 2020-10-21T21:02:09Z

This pull request was exported from Phabricator. Differential Revision: D24452071

…orch#46675) Summary: Pull Request resolved: pytorch#46675 We've found a few heuristics for using/not using mkldnn that seem to generally improve performance on 2d and 3d conv. - 1x1 convolutions are basically batch matmuls, and mkldnn's implementation appears to usually be slower than using the native conv (which lowers to aten::mm, which in turn calls mkl gemm). - 3d conv was often not using mkldnn even when it's beneficial, because the heuristic was checking the kernel depth rather than height/width. mkldnn seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are allowed by the new heuristic. Test Plan: Bento notebooks showing before/after: before: https://www.internalfb.com/intern/anp/view/?id=38089 after: https://www.internalfb.com/intern/anp/view/?id=380893 Also, I've run a conv fuzzer, and it generally supports these heuristics. I'm not sure how to best share the data since there's a lot of it (I tried about 50k parameter combinations). For the 1x1 case, about 70% were faster with "native". I played with constructing a decision tree (using scikit-learn) and found that switching back to MKL for batch size > 16 might be slightly better still, but I'm not sure it's worth complicating the heuristic. Results for some popular shapes in tabular format: ``` [------------------------- conv2d_1x1 ------------------------] | base | diff 1 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 3665.3 | 2838.4 [1, 512, 14, 14] [1024, 512, 1, 1] | 3174.7 | 3164.0 [1, 64, 56, 56] [256, 64, 1, 1] | 2249.1 | 1468.8 [1, 1024, 14, 14] [512, 1024, 1, 1] | 3158.2 | 3147.7 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 8191.8 | 3973.9 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 7901.2 | 3861.6 [1, 256, 28, 28] [512, 256, 1, 1] | 3103.9 | 2775.9 2 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 1973.7 | 1475.8 [1, 512, 14, 14] [1024, 512, 1, 1] | 2265.0 | 1603.0 [1, 64, 56, 56] [256, 64, 1, 1] | 1445.4 | 789.8 [1, 1024, 14, 14] [512, 1024, 1, 1] | 2298.8 | 1620.0 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 6350.7 | 1995.0 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 6471.2 | 1903.7 [1, 256, 28, 28] [512, 256, 1, 1] | 1932.3 | 1524.2 4 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 1198.8 | 785.6 [1, 512, 14, 14] [1024, 512, 1, 1] | 1305.0 | 901.6 [1, 64, 56, 56] [256, 64, 1, 1] | 791.0 | 472.9 [1, 1024, 14, 14] [512, 1024, 1, 1] | 1311.2 | 908.5 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 3958.6 | 997.7 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 4099.6 | 1023.1 [1, 256, 28, 28] [512, 256, 1, 1] | 1120.3 | 740.8 Times are in microseconds (us). [--------------------- conv2d_7x7 ---------------------] | base | diff 1 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 209.3 | 229.3 [1, 3, 384, 288] [64, 3, 7, 7] | 68.9 | 72.3 2 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 116.0 | 117.6 [1, 3, 384, 288] [64, 3, 7, 7] | 40.4 | 38.7 4 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 64.2 | 66.5 [1, 3, 384, 288] [64, 3, 7, 7] | 21.4 | 21.9 Times are in milliseconds (ms). [---------------------------- conv3d ---------------------------] | base | diff 1 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 602.8 | 296.2 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 52.5 | 26.5 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 50.0 | 50.3 2 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 351.0 | 168.1 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 38.5 | 14.9 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 24.8 | 26.2 4 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 212.6 | 96.0 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 21.5 | 7.6 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 12.7 | 13.3 Times are in milliseconds (ms). ``` Reviewed By: jansel Differential Revision: D24452071 fbshipit-source-id: 85c32d3b582cd18a6e4e91f1c7c9670488bfac26

facebook-github-bot · 2020-10-21T21:09:31Z

This pull request was exported from Phabricator. Differential Revision: D24452071

lly-zero-one · 2020-10-21T21:58:19Z

Great, can we add the benchmark results for the shape in #40610? Thanks.

bertmaher · 2020-10-21T22:15:20Z

Great, can we add the benchmark results for the shape in #40610? Thanks.

Sure thing, in progress now.

bertmaher · 2020-10-21T22:23:43Z

@lly-zero-one Here are the benchmark results from that PR:

[------------------------------- pr40611 -------------------------------]
                                                     |   base   |   diff 
1 threads: --------------------------------------------------------------
      [1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1]      |  4085.8  |  4109.8
      [1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]      |  1361.5  |   876.2
      [1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]    |  4472.6  |  4511.7
      [1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]     |  3709.7  |  3217.5
      [1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]      |  2316.4  |  1668.3
      [1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]     |  3717.6  |  3000.9
      [1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]     |  6736.8  |  5815.6
      [1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32]    |  1831.3  |  1853.2
      [1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]     |  3261.2  |  2816.4
      [1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1]     |  3695.8  |  3658.0
      [1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]     |  3247.3  |  2846.1
      [1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]    |  1490.2  |  1475.2
      [1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]     |  6171.7  |  5465.1
      [1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32]    |  1141.6  |  1157.5
      [1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]    |  3205.5  |  3052.0
      [1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1]    |  3595.8  |  3409.8
      [1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]    |  3237.4  |  3214.2
      [1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]    |   980.1  |  1043.9
      [1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]   |  6582.8  |  5898.5
      [1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32]  |   980.2  |   980.2
      [1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]     |  8376.2  |  3998.3
      [1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1]   |  8916.0  |  7748.4
      [1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]     |  8085.1  |  3953.2
      [1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]    |   829.3  |   884.9

Times are in microseconds (us).

bertmaher · 2020-10-21T22:24:36Z

Overall looks like there are several nice wins. A few rows show tiny slowdowns but that's almost certainly noise.

bertmaher · 2020-10-21T22:25:44Z

Also, here are benchmarks for some more popular shapes (1x1 2d, 7x7 2d, and 3d):

[------------------------- conv2d_1x1 ------------------------]
                                           |   base   |   diff 
1 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  3665.3  |  2838.4
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  3174.7  |  3164.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  2249.1  |  1468.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  3158.2  |  3147.7
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  8191.8  |  3973.9
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  7901.2  |  3861.6
      [1, 256, 28, 28] [512, 256, 1, 1]    |  3103.9  |  2775.9
2 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1973.7  |  1475.8
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  2265.0  |  1603.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  1445.4  |   789.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  2298.8  |  1620.0
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  6350.7  |  1995.0
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  6471.2  |  1903.7
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1932.3  |  1524.2
4 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1198.8  |   785.6
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  1305.0  |   901.6
      [1, 64, 56, 56] [256, 64, 1, 1]      |   791.0  |   472.9
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  1311.2  |   908.5
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  3958.6  |   997.7
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  4099.6  |  1023.1
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1120.3  |   740.8

Times are in microseconds (us).

[--------------------- conv2d_7x7 ---------------------]
                                      |   base  |   diff
1 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  209.3  |  229.3
      [1, 3, 384, 288] [64, 3, 7, 7]  |   68.9  |   72.3
2 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  116.0  |  117.6
      [1, 3, 384, 288] [64, 3, 7, 7]  |   40.4  |   38.7
4 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |   64.2  |   66.5
      [1, 3, 384, 288] [64, 3, 7, 7]  |   21.4  |   21.9

Times are in milliseconds (ms).

[---------------------------- conv3d ---------------------------]
                                               |   base  |   diff
1 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  602.8  |  296.2
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   52.5  |   26.5
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   50.0  |   50.3
2 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  351.0  |  168.1
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   38.5  |   14.9
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   24.8  |   26.2
4 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  212.6  |   96.0
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   21.5  |    7.6
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   12.7  |   13.3

Times are in milliseconds (ms).

CaoZhongZ · 2020-10-22T02:44:21Z

As recently we will update ideep and oneDNN for NCHW performance improvement. Some of the performance tuning and algorithm improvement will overlap some effort of yours. Ping me or @pinzhenx in slack and we could update your detailed data. 😊

facebook-github-bot · 2020-10-27T02:15:26Z

This pull request has been merged in 2397c8d.

PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]

PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) ghstack-source-id: 122599119 Pull Request resolved: #52909

…for multithreaded convolution" PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]

Pull Request resolved: #52909 PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

Summary: Pull Request resolved: #52909 PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Test Plan: Multithreaded 1x1 convolutions Reviewed By: wconstab, xuzhao9 Differential Revision: D26685272 fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74

…#52909) Summary: Pull Request resolved: pytorch#52909 PR pytorch#46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Test Plan: Multithreaded 1x1 convolutions Reviewed By: wconstab, xuzhao9 Differential Revision: D26685272 fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74

facebook-github-bot added the fb-exported label Oct 21, 2020

bertmaher force-pushed the export-D24452071 branch from 2dc3af7 to 907aa6c Compare October 21, 2020 20:40

bertmaher force-pushed the export-D24452071 branch from 907aa6c to 77c0379 Compare October 21, 2020 20:58

bertmaher force-pushed the export-D24452071 branch from 77c0379 to b94a6d1 Compare October 21, 2020 21:02

bertmaher requested review from jansel, lly-zero-one and ngimel October 21, 2020 21:05

bertmaher requested a review from robieta October 21, 2020 21:07

bertmaher force-pushed the export-D24452071 branch from b94a6d1 to 2079f69 Compare October 21, 2020 21:09

facebook-github-bot closed this in 2397c8d Oct 27, 2020

facebook-github-bot added the Merged label Oct 27, 2020

pinzhenx mentioned this pull request Dec 5, 2020

Optimize default CPU path of Convolution with MKLDNN #48885

Closed

bertmaher mentioned this pull request Feb 26, 2021

[pytorch] Fix mkldnn heuristic for multithreaded convolution #52909

Closed

[pytorch] Improve/fix heuristics for using mkldnn vs native conv #46675

[pytorch] Improve/fix heuristics for using mkldnn vs native conv #46675

Uh oh!

Conversation

bertmaher commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Oct 21, 2020

Uh oh!

facebook-github-bot commented Oct 21, 2020

Uh oh!

dr-ci bot commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

🚧 1 fixed upstream failure:

Uh oh!

facebook-github-bot commented Oct 21, 2020

Uh oh!

facebook-github-bot commented Oct 21, 2020

Uh oh!

facebook-github-bot commented Oct 21, 2020

Uh oh!

lly-zero-one commented Oct 21, 2020

Uh oh!

bertmaher commented Oct 21, 2020

Uh oh!

bertmaher commented Oct 21, 2020

Uh oh!

bertmaher commented Oct 21, 2020

Uh oh!

bertmaher commented Oct 21, 2020

Uh oh!

CaoZhongZ commented Oct 22, 2020

Uh oh!

facebook-github-bot commented Oct 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bertmaher commented Oct 21, 2020 •

edited

Loading

dr-ci bot commented Oct 21, 2020 •

edited

Loading