-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch] Improve/fix heuristics for using mkldnn vs native conv #46675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D24452071 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D24452071 |
2dc3af7 to
907aa6c
Compare
💊 CI failures summary and remediationsAs of commit 2079f69 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
907aa6c to
77c0379
Compare
|
This pull request was exported from Phabricator. Differential Revision: D24452071 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D24452071 |
77c0379 to
b94a6d1
Compare
…orch#46675) Summary: Pull Request resolved: pytorch#46675 We've found a few heuristics for using/not using mkldnn that seem to generally improve performance on 2d and 3d conv. - 1x1 convolutions are basically batch matmuls, and mkldnn's implementation appears to usually be slower than using the native conv (which lowers to aten::mm, which in turn calls mkl gemm). - 3d conv was often not using mkldnn even when it's beneficial, because the heuristic was checking the kernel depth rather than height/width. mkldnn seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are allowed by the new heuristic. Test Plan: Bento notebooks showing before/after: before: https://www.internalfb.com/intern/anp/view/?id=38089 after: https://www.internalfb.com/intern/anp/view/?id=380893 Also, I've run a conv fuzzer, and it generally supports these heuristics. I'm not sure how to best share the data since there's a lot of it (I tried about 50k parameter combinations). For the 1x1 case, about 70% were faster with "native". I played with constructing a decision tree (using scikit-learn) and found that switching back to MKL for batch size > 16 might be slightly better still, but I'm not sure it's worth complicating the heuristic. Results for some popular shapes in tabular format: ``` [------------------------- conv2d_1x1 ------------------------] | base | diff 1 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 3665.3 | 2838.4 [1, 512, 14, 14] [1024, 512, 1, 1] | 3174.7 | 3164.0 [1, 64, 56, 56] [256, 64, 1, 1] | 2249.1 | 1468.8 [1, 1024, 14, 14] [512, 1024, 1, 1] | 3158.2 | 3147.7 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 8191.8 | 3973.9 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 7901.2 | 3861.6 [1, 256, 28, 28] [512, 256, 1, 1] | 3103.9 | 2775.9 2 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 1973.7 | 1475.8 [1, 512, 14, 14] [1024, 512, 1, 1] | 2265.0 | 1603.0 [1, 64, 56, 56] [256, 64, 1, 1] | 1445.4 | 789.8 [1, 1024, 14, 14] [512, 1024, 1, 1] | 2298.8 | 1620.0 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 6350.7 | 1995.0 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 6471.2 | 1903.7 [1, 256, 28, 28] [512, 256, 1, 1] | 1932.3 | 1524.2 4 threads: ---------------------------------------------------- [1, 128, 56, 56] [256, 128, 1, 1] | 1198.8 | 785.6 [1, 512, 14, 14] [1024, 512, 1, 1] | 1305.0 | 901.6 [1, 64, 56, 56] [256, 64, 1, 1] | 791.0 | 472.9 [1, 1024, 14, 14] [512, 1024, 1, 1] | 1311.2 | 908.5 [1, 1024, 7, 7] [2048, 1024, 1, 1] | 3958.6 | 997.7 [1, 2048, 7, 7] [1024, 2048, 1, 1] | 4099.6 | 1023.1 [1, 256, 28, 28] [512, 256, 1, 1] | 1120.3 | 740.8 Times are in microseconds (us). [--------------------- conv2d_7x7 ---------------------] | base | diff 1 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 209.3 | 229.3 [1, 3, 384, 288] [64, 3, 7, 7] | 68.9 | 72.3 2 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 116.0 | 117.6 [1, 3, 384, 288] [64, 3, 7, 7] | 40.4 | 38.7 4 threads: --------------------------------------------- [25, 3, 48, 320] [64, 3, 7, 7] | 64.2 | 66.5 [1, 3, 384, 288] [64, 3, 7, 7] | 21.4 | 21.9 Times are in milliseconds (ms). [---------------------------- conv3d ---------------------------] | base | diff 1 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 602.8 | 296.2 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 52.5 | 26.5 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 50.0 | 50.3 2 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 351.0 | 168.1 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 38.5 | 14.9 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 24.8 | 26.2 4 threads: ------------------------------------------------------ [1, 3, 16, 224, 224] [32, 3, 1, 7, 7] | 212.6 | 96.0 [1, 3, 4, 112, 112] [64, 3, 3, 7, 7] | 21.5 | 7.6 [1, 256, 8, 14, 14] [256, 256, 3, 3, 3] | 12.7 | 13.3 Times are in milliseconds (ms). ``` Reviewed By: jansel Differential Revision: D24452071 fbshipit-source-id: 85c32d3b582cd18a6e4e91f1c7c9670488bfac26
|
This pull request was exported from Phabricator. Differential Revision: D24452071 |
b94a6d1 to
2079f69
Compare
|
Great, can we add the benchmark results for the shape in #40610? Thanks. |
Sure thing, in progress now. |
|
@lly-zero-one Here are the benchmark results from that PR: |
|
Overall looks like there are several nice wins. A few rows show tiny slowdowns but that's almost certainly noise. |
|
Also, here are benchmarks for some more popular shapes (1x1 2d, 7x7 2d, and 3d): |
|
As recently we will update ideep and oneDNN for NCHW performance improvement. Some of the performance tuning and algorithm improvement will overlap some effort of yours. Ping me or @pinzhenx in slack and we could update your detailed data. 😊 |
|
This pull request has been merged in 2397c8d. |
PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]
PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]
PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]
PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) ghstack-source-id: 122599119 Pull Request resolved: #52909
…for multithreaded convolution" PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/) [ghstack-poisoned]
Pull Request resolved: #52909 PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)
Summary: Pull Request resolved: #52909 PR #46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Test Plan: Multithreaded 1x1 convolutions Reviewed By: wconstab, xuzhao9 Differential Revision: D26685272 fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
…#52909) Summary: Pull Request resolved: pytorch#52909 PR pytorch#46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Test Plan: Multithreaded 1x1 convolutions Reviewed By: wconstab, xuzhao9 Differential Revision: D26685272 fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
…#52909) Summary: Pull Request resolved: pytorch#52909 PR pytorch#46675 introduced heuristics to use thnn_conv2d for 1x1 convolutions, since mkldnn had a bug that was slowing those cases down. Unfortunately, the test plan for that PR only tested single-threaded convolutions; mkldnn is considerably faster on multithreaded convolutions. An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz ``` input:{1, 64, 192, 256}, weight:{32, 64, 1, 1} thnn_conv2d: GFLOPS/s=104.574G/s mkldnn_convolution: GFLOPS/s=467.357G/s ``` ghstack-source-id: 122627564 Test Plan: Multithreaded 1x1 convolutions Reviewed By: wconstab, xuzhao9 Differential Revision: D26685272 fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
Summary:
We've found a few heuristics for using/not using mkldnn that seem to generally
improve performance on 2d and 3d conv.
1x1 convolutions are basically batch matmuls, and mkldnn's implementation
appears to usually be slower than using the native conv (which lowers to
aten::mm, which in turn calls mkl gemm).
3d conv was often not using mkldnn even when it's beneficial, because the
heuristic was checking the kernel depth rather than height/width. mkldnn
seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are
allowed by the new heuristic.
Test Plan:
Bento notebooks showing before/after:
before: https://www.internalfb.com/intern/anp/view/?id=38089
after: https://www.internalfb.com/intern/anp/view/?id=380893
I need to figure out the right way to share notebooks, and also probably clean
this up into a simple text table for GitHub...
Also, I've run a conv fuzzer, and it generally supports these heuristics. I'm
not sure how to best share the data since there's a lot of it (I tried about
50k parameter combinations).
For the 1x1 case, about 70% were faster with "native". I played with
constructing a decision tree (using scikit-learn) and found that switching back
to MKL for batch size > 16 might be slightly better still, but I'm not sure
it's worth complicating the heuristic.
Reviewed By: jansel
Differential Revision: D24452071
Thanks to:
@jansel for finding these convolution shapes and showing that they were underperforming, and developing optimized implementations
@ngimel for realizing that the MKL heuristics were broken in the 3d conv case
@robieta for providing the conv fuzzing and benchmarking tools