Skip to content

Conversation

@bertmaher
Copy link
Contributor

@bertmaher bertmaher commented Oct 21, 2020

Summary:
We've found a few heuristics for using/not using mkldnn that seem to generally
improve performance on 2d and 3d conv.

  • 1x1 convolutions are basically batch matmuls, and mkldnn's implementation
    appears to usually be slower than using the native conv (which lowers to
    aten::mm, which in turn calls mkl gemm).

  • 3d conv was often not using mkldnn even when it's beneficial, because the
    heuristic was checking the kernel depth rather than height/width. mkldnn
    seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are
    allowed by the new heuristic.

Test Plan:
Bento notebooks showing before/after:
before: https://www.internalfb.com/intern/anp/view/?id=38089
after: https://www.internalfb.com/intern/anp/view/?id=380893

I need to figure out the right way to share notebooks, and also probably clean
this up into a simple text table for GitHub...

Also, I've run a conv fuzzer, and it generally supports these heuristics. I'm
not sure how to best share the data since there's a lot of it (I tried about
50k parameter combinations).

For the 1x1 case, about 70% were faster with "native". I played with
constructing a decision tree (using scikit-learn) and found that switching back
to MKL for batch size > 16 might be slightly better still, but I'm not sure
it's worth complicating the heuristic.

Reviewed By: jansel

Differential Revision: D24452071

Thanks to:
@jansel for finding these convolution shapes and showing that they were underperforming, and developing optimized implementations
@ngimel for realizing that the MKL heuristics were broken in the 3d conv case
@robieta for providing the conv fuzzing and benchmarking tools

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24452071

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24452071

@dr-ci
Copy link

dr-ci bot commented Oct 21, 2020

💊 CI failures summary and remediations

As of commit 2079f69 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24452071

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24452071

…orch#46675)

Summary:
Pull Request resolved: pytorch#46675

We've found a few heuristics for using/not using mkldnn that seem to generally
improve performance on 2d and 3d conv.

- 1x1 convolutions are basically batch matmuls, and mkldnn's implementation
  appears to usually be slower than using the native conv (which lowers to
  aten::mm, which in turn calls mkl gemm).

- 3d conv was often not using mkldnn even when it's beneficial, because the
  heuristic was checking the kernel depth rather than height/width.  mkldnn
  seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are
  allowed by the new heuristic.

Test Plan:
Bento notebooks showing before/after:
before: https://www.internalfb.com/intern/anp/view/?id=38089
after: https://www.internalfb.com/intern/anp/view/?id=380893

Also, I've run a conv fuzzer, and it generally supports these heuristics.  I'm
not sure how to best share the data since there's a lot of it (I tried about
50k parameter combinations).

For the 1x1 case, about 70% were faster with "native".  I played with
constructing a decision tree (using scikit-learn) and found that switching back
to MKL for batch size > 16 might be slightly better still, but I'm not sure
it's worth complicating the heuristic.

Results for some popular shapes in tabular format:
```
[------------------------- conv2d_1x1 ------------------------]
                                           |   base   |   diff
1 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  3665.3  |  2838.4
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  3174.7  |  3164.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  2249.1  |  1468.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  3158.2  |  3147.7
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  8191.8  |  3973.9
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  7901.2  |  3861.6
      [1, 256, 28, 28] [512, 256, 1, 1]    |  3103.9  |  2775.9
2 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1973.7  |  1475.8
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  2265.0  |  1603.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  1445.4  |   789.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  2298.8  |  1620.0
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  6350.7  |  1995.0
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  6471.2  |  1903.7
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1932.3  |  1524.2
4 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1198.8  |   785.6
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  1305.0  |   901.6
      [1, 64, 56, 56] [256, 64, 1, 1]      |   791.0  |   472.9
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  1311.2  |   908.5
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  3958.6  |   997.7
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  4099.6  |  1023.1
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1120.3  |   740.8

Times are in microseconds (us).

[--------------------- conv2d_7x7 ---------------------]
                                      |   base  |   diff
1 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  209.3  |  229.3
      [1, 3, 384, 288] [64, 3, 7, 7]  |   68.9  |   72.3
2 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  116.0  |  117.6
      [1, 3, 384, 288] [64, 3, 7, 7]  |   40.4  |   38.7
4 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |   64.2  |   66.5
      [1, 3, 384, 288] [64, 3, 7, 7]  |   21.4  |   21.9

Times are in milliseconds (ms).

[---------------------------- conv3d ---------------------------]
                                               |   base  |   diff
1 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  602.8  |  296.2
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   52.5  |   26.5
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   50.0  |   50.3
2 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  351.0  |  168.1
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   38.5  |   14.9
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   24.8  |   26.2
4 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  212.6  |   96.0
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   21.5  |    7.6
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   12.7  |   13.3

Times are in milliseconds (ms).
```

Reviewed By: jansel

Differential Revision: D24452071

fbshipit-source-id: 85c32d3b582cd18a6e4e91f1c7c9670488bfac26
@bertmaher bertmaher requested a review from robieta October 21, 2020 21:07
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24452071

@lly-zero-one
Copy link
Contributor

Great, can we add the benchmark results for the shape in #40610? Thanks.

@bertmaher
Copy link
Contributor Author

Great, can we add the benchmark results for the shape in #40610? Thanks.

Sure thing, in progress now.

@bertmaher
Copy link
Contributor Author

@lly-zero-one Here are the benchmark results from that PR:

[------------------------------- pr40611 -------------------------------]
                                                     |   base   |   diff 
1 threads: --------------------------------------------------------------
      [1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1]      |  4085.8  |  4109.8
      [1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]      |  1361.5  |   876.2
      [1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]    |  4472.6  |  4511.7
      [1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]     |  3709.7  |  3217.5
      [1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]      |  2316.4  |  1668.3
      [1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]     |  3717.6  |  3000.9
      [1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]     |  6736.8  |  5815.6
      [1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32]    |  1831.3  |  1853.2
      [1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]     |  3261.2  |  2816.4
      [1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1]     |  3695.8  |  3658.0
      [1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]     |  3247.3  |  2846.1
      [1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]    |  1490.2  |  1475.2
      [1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]     |  6171.7  |  5465.1
      [1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32]    |  1141.6  |  1157.5
      [1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]    |  3205.5  |  3052.0
      [1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1]    |  3595.8  |  3409.8
      [1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]    |  3237.4  |  3214.2
      [1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]    |   980.1  |  1043.9
      [1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]   |  6582.8  |  5898.5
      [1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32]  |   980.2  |   980.2
      [1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]     |  8376.2  |  3998.3
      [1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1]   |  8916.0  |  7748.4
      [1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]     |  8085.1  |  3953.2
      [1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]    |   829.3  |   884.9

Times are in microseconds (us).

@bertmaher
Copy link
Contributor Author

Overall looks like there are several nice wins. A few rows show tiny slowdowns but that's almost certainly noise.

@bertmaher
Copy link
Contributor Author

Also, here are benchmarks for some more popular shapes (1x1 2d, 7x7 2d, and 3d):

[------------------------- conv2d_1x1 ------------------------]
                                           |   base   |   diff 
1 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  3665.3  |  2838.4
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  3174.7  |  3164.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  2249.1  |  1468.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  3158.2  |  3147.7
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  8191.8  |  3973.9
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  7901.2  |  3861.6
      [1, 256, 28, 28] [512, 256, 1, 1]    |  3103.9  |  2775.9
2 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1973.7  |  1475.8
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  2265.0  |  1603.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  1445.4  |   789.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  2298.8  |  1620.0
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  6350.7  |  1995.0
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  6471.2  |  1903.7
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1932.3  |  1524.2
4 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1198.8  |   785.6
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  1305.0  |   901.6
      [1, 64, 56, 56] [256, 64, 1, 1]      |   791.0  |   472.9
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  1311.2  |   908.5
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  3958.6  |   997.7
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  4099.6  |  1023.1
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1120.3  |   740.8

Times are in microseconds (us).

[--------------------- conv2d_7x7 ---------------------]
                                      |   base  |   diff
1 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  209.3  |  229.3
      [1, 3, 384, 288] [64, 3, 7, 7]  |   68.9  |   72.3
2 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  116.0  |  117.6
      [1, 3, 384, 288] [64, 3, 7, 7]  |   40.4  |   38.7
4 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |   64.2  |   66.5
      [1, 3, 384, 288] [64, 3, 7, 7]  |   21.4  |   21.9

Times are in milliseconds (ms).

[---------------------------- conv3d ---------------------------]
                                               |   base  |   diff
1 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  602.8  |  296.2
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   52.5  |   26.5
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   50.0  |   50.3
2 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  351.0  |  168.1
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   38.5  |   14.9
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   24.8  |   26.2
4 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  212.6  |   96.0
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   21.5  |    7.6
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   12.7  |   13.3

Times are in milliseconds (ms).

@CaoZhongZ
Copy link
Contributor

As recently we will update ideep and oneDNN for NCHW performance improvement. Some of the performance tuning and algorithm improvement will overlap some effort of yours. Ping me or @pinzhenx in slack and we could update your detailed data. 😊

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2397c8d.

bertmaher added a commit that referenced this pull request Feb 26, 2021
PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

[ghstack-poisoned]
bertmaher added a commit that referenced this pull request Feb 26, 2021
PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

[ghstack-poisoned]
bertmaher added a commit that referenced this pull request Feb 26, 2021
PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

[ghstack-poisoned]
bertmaher added a commit that referenced this pull request Feb 26, 2021
PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

ghstack-source-id: 122599119
Pull Request resolved: #52909
bertmaher added a commit that referenced this pull request Feb 26, 2021
…for multithreaded convolution"

PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)

[ghstack-poisoned]
bertmaher added a commit that referenced this pull request Feb 26, 2021
Pull Request resolved: #52909

PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564

Differential Revision: [D26685272](https://our.internmc.facebook.com/intern/diff/D26685272/)
facebook-github-bot pushed a commit that referenced this pull request Mar 1, 2021
Summary:
Pull Request resolved: #52909

PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564

Test Plan: Multithreaded 1x1 convolutions

Reviewed By: wconstab, xuzhao9

Differential Revision: D26685272

fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
…#52909)

Summary:
Pull Request resolved: pytorch#52909

PR pytorch#46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564

Test Plan: Multithreaded 1x1 convolutions

Reviewed By: wconstab, xuzhao9

Differential Revision: D26685272

fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…#52909)

Summary:
Pull Request resolved: pytorch#52909

PR pytorch#46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564

Test Plan: Multithreaded 1x1 convolutions

Reviewed By: wconstab, xuzhao9

Differential Revision: D26685272

fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants