DNNL: enable conv3d #35662

XiaobingSuper · 2020-03-30T10:19:00Z

Stack from ghstack:

DNNL: enable dilation conv #40220 DNNL: enable dilation conv
DNNL: enable max_pool3d and avg_pool3d #35664 DNNL: enable max_pool3d and avg_pool3d
DNNL: enable batchnorm3d #35663 DNNL: enable batchnorm3d
DNNL: enable conv3d #35662 DNNL: enable conv3d

Differential Revision: D22102408

[ghstack-poisoned]

dr-ci · 2020-03-30T10:20:33Z

💊 CI failures summary and remediations

As of commit 91a0981 (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_bazel_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 19 02:23:59 TIMEOUT: //:integration_test (Summary)

Jun 19 02:23:59              for (int k = 0; k < cross_chunk_shuffle_count; ++k) { 
Jun 19 02:23:59                              ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~ 
Jun 19 02:23:59 test/cpp/api/dataloader.cpp:2204:13: warning: unused variable 'offset' [-Wunused-variable] 
Jun 19 02:23:59          int offset = 0; 
Jun 19 02:23:59              ^~~~~~ 
Jun 19 02:23:59 test/cpp/api/dataloader.cpp: In member function 'virtual void DataLoaderTest_CustomPreprocessPolicy_Test::TestBody()': 
Jun 19 02:23:59 test/cpp/api/dataloader.cpp:2294:29: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] 
Jun 19 02:23:59            for (int i = 0; i < batch_result.size(); i += chunk_size) { 
Jun 19 02:23:59                            ~~^~~~~~~~~~~~~~~~~~~~~ 
Jun 19 02:23:59  
Jun 19 02:23:59 TIMEOUT: //:integration_test (Summary) 
Jun 19 02:23:59       /var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/bazel-out/k8-fastbuild/testlogs/integration_test/test.log 
Jun 19 02:23:59 INFO: From Testing //:integration_test: 
Jun 19 02:23:59 ==================== Test output for //:integration_test: 
Jun 19 02:23:59 Running main() from gmock_main.cc 
Jun 19 02:23:59 Note: Google Test filter = -*CUDA 
Jun 19 02:23:59 [==========] Running 1 test from 1 test suite. 
Jun 19 02:23:59 [----------] Global test environment set-up. 
Jun 19 02:23:59 [----------] 1 test from IntegrationTest 
Jun 19 02:23:59 [ RUN      ] IntegrationTest.CartPole 
Jun 19 02:23:59 -- Test timed out at 2020-06-19 02:23:44 UTC --

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.3-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

XiaobingSuper · 2020-03-30T10:27:39Z

@ngimel , @VitalyFedyunin , this PR is about enable DNNL 3d ops, including conv, pooling and batchnorm, For resnext3d-101, and test on real dataset UCF101(input size is 10x3x32x128x170), we can get ~13x performance improvement compare to native cpu path on skx-8180. You can see the details in resnext3d-101. Thanks!

[ghstack-poisoned]

vincentqb · 2020-03-30T16:53:49Z

@VitalyFedyunin -- could you review this PR?

[ghstack-poisoned]

XiaobingSuper · 2020-04-01T05:36:03Z

@VitalyFedyunin, please help review this code, thanks!

yinghai · 2020-04-01T19:23:37Z

@lly-zero-one Could you comment on the perf side?

lly-zero-one · 2020-04-01T21:25:48Z

We have few internal changes for the current Conv3d implementation for performance improvement, which will be upstreamed in next week. So I am wondering whether we could do a full performance benchmark. For 2d case, we found the mkldnn conv is 2x slower than the native implementation on a specific production model (I will file a repro).

mingfeima · 2020-04-02T01:33:21Z

We have few internal changes for the current Conv3d implementation for performance improvement, which will be upstreamed in next week. So I am wondering whether we could do a full performance benchmark. For 2d case, we found the mkldnn conv is 2x slower than the native implementation on a specific production model (I will file a repro).

2x performance diff with native implementation is serious... In future if you have similar issues, you may also address this in the Teams channel, we will get hands on it asap.
@Jianhui-Li, @uyongw, @jgong5

lly-zero-one · 2020-04-03T06:18:07Z

#35937 is for tracking the issue.

VitalyFedyunin

This is inconsistent with the approach we use for operators naming, we always explicitly specify 1d,2d,3d operators and we are letting python nn module to dispatch to the proper one.

Using this approach you are not only will follow convention, but also avoid introducing back incompatible changes.

This comment applies to all PRs in stack.

aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp

[ghstack-poisoned]

XiaobingSuper · 2020-05-12T01:23:22Z

@VitalyFedyunin

[ghstack-poisoned]

XiaobingSuper · 2020-05-29T04:34:27Z

@ngimel @VitalyFedyunin @albanD

Summary: - Bump DNNL to 1.5 - Bug fixes and improvements in ideep - suppress g++ Wreorder warning - avoid rebuilding `libmkldnn.so` uxlfoundation/oneDNN#743 - enable conv3d (integration code was checked in by Xiaobing #35662) Pull Request resolved: #40088 Differential Revision: D22071530 Pulled By: albanD fbshipit-source-id: e7a53d7421e8a7a03e36a7dfb68edc565a2f00df

[ghstack-poisoned]

XiaobingSuper · 2020-06-17T05:41:30Z

@ngimel, please help merge those PRs, thanks!

fmassa · 2020-06-17T22:02:58Z

torch/utils/mkldnn.py

        self.bias = state[1].to_mkldnn()
        self.training = state[2]

+class MkldnnConv3d(torch.jit.ScriptModule):


FYI I believe this is kind of a legacy API as we now compile nn Modules recursively. Cc @eellison

yea, correct, better to inherit from torch.nn.Module, you shouldn't need any other changes

Thanks, I will change it, and also for other case at next step.

@eellison , if it is inherited from torch.nn.Module, there will has a problem for torch.jit.save method, because for a MKLDNN module, the parameters are MKLDNN tensor which are opaque tensors(do not have storage), we will first call .to_dense() at getstate to save this script module. I will changed until this problem can be sovled. thanks!

ngimel · 2020-06-17T22:45:51Z

torch/utils/mkldnn.py

+
+    @torch.jit.script_method
+    def forward(self, x):
+        return torch.conv3d(


what would happen here if parameters are not supported by mkldnn (use_mkldnn would return false e.g. because of dilation, or because x is wrong type), but weight is already reordered?
Also, suppose mkldnn_convolution is indeed called from Convolution.cpp, what happens next? In mkldnn_convolution there's only

ideep::tensor mkldnn_output = _mkldnn_conv2d( mkldnn_input, mkldnn_weight, mkldnn_bias, padding, stride, dilation, groups);

If it is able to handle conv3d, then at the very least it is confusingly named.

For the first question, DNNL also support dilation for convNd, I can enable it, so there only has one case not supported by DNNL: x is not a float tensor, but for this case, I think weight is also a float tensor, it will report an error to user when reorder the weight, because it need to call **.to_mkldnn()**first which will check the tensor's type.

For the second question, yes, the name is confused, I will change it.

Ok, please enable dilated convolution then. I agree that for correct user inputs this situation should not happen, but even in case of incorrect user inputs (user had float weight when creating a module, but is sending double tensor, or cuda tensor, to forward) the error message should be clear and helpful, and I don't know what will happen in this case.

DNNL dilation conv is enabled now. thanks!

Differential Revision: [D22102408](https://our.internmc.facebook.com/intern/diff/D22102408) [ghstack-poisoned]

ngimel · 2020-06-18T16:16:02Z

aten/src/ATen/native/Convolution.cpp

  return (input.is_mkldnn()) || // input is mkldnn Tensor
    (input.options().backend() == at::Backend::CPU &&
     input.scalar_type() == kFloat && // only on CPU Float Tensors
     !is_dilated() && // doesn't support dilation


should you also remove is_dilated check from here if it's actually supported by mkldnn?

There has another PR to do it, see #40220.

ngimel · 2020-06-18T16:22:10Z

aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp

+    IntArrayRef dilation,
+    int64_t groups) {
+
+  auto stride_vec = expand_param_if_needed(stride, "stride", 3);


out of curiosity, why do you need to expand stride etc here, but don't need to do it for 2d conv? If it were not for these expansion calls, reorder_weight functions are exactly the same for 2d and 3d.

Yes, we don't need expand them, they have been expanded at

pytorch/torch/nn/modules/conv.py

Lines 402 to 405 in 13bd599

kernel_size = _pair(kernel_size)

stride = _pair(stride)

padding = _pair(padding)

dilation = _pair(dilation)

ngimel · 2020-06-18T18:25:47Z

Currently if I send tensor of the wrong type (e.g. double) to mkldnn convolution, I get an error

RuntimeError: tensor.scalar_type() == ScalarType::Float INTERNAL ASSERT FAILED at "../aten/src/ATen/native/mkldnn/MKLDNNCommon.cpp":70, please report a bug to PyTorch. itensor_view_from_dense expects float tensor input

which is the wrong error type, thrown by TORCH_INTERNAL_ASSERT. Should be TORCH_CHECK instead.

Differential Revision: [D22102408](https://our.internmc.facebook.com/intern/diff/D22102408) [ghstack-poisoned]

XiaobingSuper · 2020-06-19T01:35:17Z

Currently if I send tensor of the wrong type (e.g. double) to mkldnn convolution, I get an error
RuntimeError: tensor.scalar_type() == ScalarType::Float INTERNAL ASSERT FAILED at "../aten/src/ATen/native/mkldnn/MKLDNNCommon.cpp":70, please report a bug to PyTorch. itensor_view_from_dense expects float tensor input
which is the wrong error type, thrown by TORCH_INTERNAL_ASSERT. Should be TORCH_CHECK instead.

Changed to TORCH_CHECK now.

Summary: - Bump DNNL to 1.5 - Bug fixes and improvements in ideep - suppress g++ Wreorder warning - avoid rebuilding `libmkldnn.so` uxlfoundation/oneDNN#743 - enable conv3d (integration code was checked in by Xiaobing pytorch#35662) Pull Request resolved: pytorch#40088 Differential Revision: D22071530 Pulled By: albanD fbshipit-source-id: e7a53d7421e8a7a03e36a7dfb68edc565a2f00df

XiaobingSuper · 2020-06-22T00:38:02Z

@VitalyFedyunin

facebook-github-bot · 2020-06-22T20:15:07Z

@VitalyFedyunin merged this pull request in 6ba807c.

VitalyFedyunin · 2020-06-23T03:21:23Z

hi @XiaobingSuper we had to revert this stack (see https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master logs), could you please create new PRs

DNNL: enable conv3d

6427bea

[ghstack-poisoned]

This was referenced Mar 30, 2020

DNNL: enable batchnorm3d #35663

Closed

DNNL: enable max_pool3d and avg_pool3d #35664

Closed

pytorchbot added the open source label Mar 30, 2020

Update on "DNNL: enable conv3d"

fa64ce7

[ghstack-poisoned]

vincentqb requested a review from VitalyFedyunin March 30, 2020 16:53

vincentqb added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 30, 2020

XiaobingSuper requested a review from ngimel March 31, 2020 07:12

Update on "DNNL: enable conv3d"

289bcba

[ghstack-poisoned]

VitalyFedyunin suggested changes Apr 29, 2020

View reviewed changes

aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp Outdated Show resolved Hide resolved

Update on "DNNL: enable conv3d"

bb44512

[ghstack-poisoned]

XiaobingSuper requested a review from VitalyFedyunin May 7, 2020 14:45

Update on "DNNL: enable conv3d"

3ac4bab

[ghstack-poisoned]

XiaobingSuper requested a review from albanD May 14, 2020 01:15

Update on "DNNL: enable conv3d"

f79b487

[ghstack-poisoned]

pinzhenx mentioned this pull request Jun 16, 2020

Upgrade DNNL to 1.5 #40088

Closed

Update on "DNNL: enable conv3d"

cb6573d

[ghstack-poisoned]

VitalyFedyunin approved these changes Jun 17, 2020

View reviewed changes

fmassa reviewed Jun 17, 2020

View reviewed changes

ngimel reviewed Jun 18, 2020

View reviewed changes

Update on "DNNL: enable conv3d"

35f54d6

Differential Revision: [D22102408](https://our.internmc.facebook.com/intern/diff/D22102408) [ghstack-poisoned]

XiaobingSuper mentioned this pull request Jun 18, 2020

DNNL: enable dilation conv #40220

Closed

XiaobingSuper requested a review from VitalyFedyunin June 18, 2020 12:00

ngimel reviewed Jun 18, 2020

View reviewed changes

Update on "DNNL: enable conv3d"

91a0981

Differential Revision: [D22102408](https://our.internmc.facebook.com/intern/diff/D22102408) [ghstack-poisoned]

XiaobingSuper requested a review from ngimel June 19, 2020 04:33

ngimel approved these changes Jun 19, 2020

View reviewed changes

facebook-github-bot closed this in 6ba807c Jun 22, 2020

facebook-github-bot added the merged label Jun 22, 2020

facebook-github-bot deleted the gh/xiaobingsuper/9/head branch June 26, 2020 14:16

XiaobingSuper mentioned this pull request Jun 29, 2020

[reland][DNNL]:enable conv3d #40691

Closed

mruberry added the Merged label Oct 28, 2020

	kernel_size = _pair(kernel_size)
	stride = _pair(stride)
	padding = _pair(padding)
	dilation = _pair(dilation)

DNNL: enable conv3d #35662

DNNL: enable conv3d #35662

Uh oh!

Conversation

XiaobingSuper commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_bazel_test (1/1)

ci.pytorch.org: 1 failed

Uh oh!

XiaobingSuper commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentqb commented Mar 30, 2020

Uh oh!

XiaobingSuper commented Apr 1, 2020

Uh oh!

yinghai commented Apr 1, 2020

Uh oh!

lly-zero-one commented Apr 1, 2020

Uh oh!

mingfeima commented Apr 2, 2020

Uh oh!

lly-zero-one commented Apr 3, 2020

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XiaobingSuper commented May 12, 2020

Uh oh!

XiaobingSuper commented May 29, 2020

Uh oh!

XiaobingSuper commented Jun 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jun 18, 2020

Uh oh!

XiaobingSuper commented Jun 19, 2020

Uh oh!

XiaobingSuper commented Jun 22, 2020

Uh oh!

facebook-github-bot commented Jun 22, 2020

Uh oh!

VitalyFedyunin commented Jun 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

XiaobingSuper commented Mar 30, 2020 •

edited

Loading

dr-ci bot commented Mar 30, 2020 •

edited

Loading

XiaobingSuper commented Mar 30, 2020 •

edited

Loading