parallelize vol2col and col2vol of Conv3D with CPU backend #4824

MlWoo · 2018-01-24T00:43:20Z

Refer to the Issue.
Vol2col and col2vol in Conv3D could not be parallelized with CPU backend. Actually THNN_(unfolded_copy_vol) function is OK to use OpenMP. But test case will fail when using OpenMP in THNN_(unfolded_acc_vol) function . The reason why the transforms could not be parallized with CPU backend is read-write confict with multi-thread. However, it will avert the problem if we adopt another coding strategy as that in GPU backend.

soumith · 2018-01-24T17:16:24Z

@fmassa can you review this tomorrow.

aten/src/THNN/generic/VolumetricConvolutionMM.c

MlWoo · 2018-01-25T08:02:59Z

I am sorry that I did not use git correctly. It must be a better way to keep the all commit in this PR. I have added #ifdef _OPENMP according to the suggestion provided by @soumith .

fmassa

I think this looks good!

This implementation closely follows the CUDA implementation, so reviewing it gets easier with that in mind, specially because there are a number of intermediate variables that are used to cache some intermediate results and whose meaning might be a bit obscure.

One thing is that I think it might be better to replace all instances of long to a fixed size type, like int64_t. Those long variables were already there before this PR, so this might be changed at a later time, in which case we should open an issue then to keep track of it.

@soumith Our speed regression tests do not check for correctness, right? This should not matter for this PR, but for #2764 all the optimizations were not being tested for correctness because the tensor sizes that were used were small enough not hit the OMP_THRESHOLD, so the old code-path was used for the tests. It might be useful to have some tests somewhere that exercise those codepaths, but they might be slow and thus run separately from the common tests?

aten/src/THNN/generic/VolumetricConvolutionMM.c

MlWoo · 2018-01-30T00:17:56Z

@fmassa It is maybe a good idea to set an environment variables of OMP_THRESHOLD to control code path as that to choose intrinsics to vectorize operation.

fmassa · 2018-01-30T01:48:40Z

@MlWoo yes, we should not use long types anymore, but instead int64_t, as we are working towards having Windows support and long can be 32 bits.

About having an environment variable for OMP_THRESHOLD, yeah, that might be useful for testing purposes (and might make it easier to tune this parameter for specific machines).

MlWoo · 2018-01-30T02:01:19Z

@fmassa
Is it OK to get env variable in init.c file and set a macro which can be accessed in every OPs file in THNN module? Maybe it is OK to handle with it in similar way in TH module.

fmassa · 2018-01-30T05:31:26Z

@MlWoo that could be a possibility, but wouldn't it be an overkill at the THNN level in its current state? Or do you have plans to tune other operations in THNN, that would benefit from a fine-grained control?

I was initially thinking about adding such an env var in TH to make the tests check the OMP path as well, without it being too slow because of large tensors.

MlWoo · 2018-01-30T05:48:40Z

@fmassa The env variable in the THNN or TH module in that level is used to test both the code paths with or without openmp but not to tune the appropriate openmp threshold. If pytorch wants to control the openmp more finely, maybe it provides a macro table to look up. It is a trival job to tune the operations in THNN and TH modules in consideration of different CPU platforms. We could design a scripts to help us do that work if pytorch thinks it's necessary.

fmassa · 2018-01-30T05:56:07Z

@MlWoo ok. I think it might be good to do something like that for TH, but on a separate PR.

MlWoo · 2018-01-30T05:59:33Z

@fmassa It is a very good idea. We can start with some OPs. Could you provide a table to list the common and typical OPs? We also considerate some OPs related with our work firstly.

MlWoo · 2018-01-31T09:46:38Z

It's very strange that the latest modification results in the dataloader test failure but it just involves with the intermediate computation process of Conv3D on CPU. And it only failed to pass the test on linux with Cuda. I am afraid that I maybe put off the PR.

MlWoo · 2018-01-31T15:19:38Z

@fmassa I have compared two logs of the PR and ohter PR #4953. The same error occurs both in the two PR. My log failed to pass and terminated, but the other one is OK. Could you help me check what happens?

MlWoo · 2018-02-01T05:25:14Z

@fmassa I doubt that there are some problems in test paltform in past hours. There are different test results in different time,but the code is same.

fmassa · 2018-02-01T19:30:04Z

@MlWoo I believe the build was broken by #4943, but this should have been fixed by #4980

MlWoo · 2018-02-02T03:01:41Z

I really want to say somthing about the test platform. But whatever, the code pass it now. @fmassa Could you review the code again? Especially the unfolded_copy_vol. I have used new method to implement it again. I think conv2d, conv3d and their variants will benefit more from the new mehod. Maybe I should open a new issue.

MlWoo · 2018-02-05T20:36:42Z

Could you spare time to review the code？If it's OK, I could open a new issue on other conv to improve their performance probably.

fmassa · 2018-02-05T20:37:45Z

@MlWoo I'll have another look at it today and I'll let you know

fmassa

Thanks a lot for the PR @MlWoo!
I think this looks very good, I have no further comments.

And yes, I have the impression that conv2d could benefit from similar optimizations, and that would be a great addition! Just note though that conv2d dispatches to NNPack in some cases, which should be very efficient, both memory-wise as well as speed-wise.

Quick comment for if you decide to work on improving the other variants of conv3d:
Looking around, I realized that we unfortunately have a lot of code duplication on VolumetricFullDilatedConvolution and VolumetricConvolutionMM, and we have two versions of their unfold (or vol2col) implementations. The main difference seems to be that we don't store the finput for the whole batch, saving on memory, but losing on speed, because we can't parallelize over the batch dimension in the same way. A similar thing happens in conv2d and conv2d_transpose, where we use two different implementations of im2col (or unfold).
It would be great at some point to remove the redundancy and use a single implementation, but that would require discussing the trade-offs between speed/memory.

fmassa · 2018-02-06T06:19:20Z

@soumith I think this can be merged

MlWoo · 2018-02-06T06:26:31Z

@fmassa Thanks a lot for your praise . It really needs more discussion about the trade-offs between speed/memory. Your point is very professional. Thank you.

soumith · 2018-02-06T15:54:11Z

thanks @MlWoo !

MlWoo mentioned this pull request Jan 24, 2018

[feature request] Conv3D vol2col and col2vol optimization with CPU backend #4738

Closed

soumith suggested changes Jan 24, 2018

View reviewed changes

aten/src/THNN/generic/VolumetricConvolutionMM.c Outdated

This comment was marked as off-topic.

Sign in to view

fmassa approved these changes Jan 29, 2018

View reviewed changes

MlWoo added 2 commits January 31, 2018 16:38

parallelize vol2col and col2vol of Conv3D with CPU backend

9ae9731

parallelize vol2col and col2vol of Conv3D with CPU backend

ec47b44

interface test of conv3d

f6ff846

replace long with int64_t

f9a6613

correct pragmatic error of comments

4560bf2

fmassa approved these changes Feb 6, 2018

View reviewed changes

soumith merged commit 13ef843 into pytorch:master Feb 6, 2018

MlWoo deleted the dev-omp branch February 7, 2018 00:23

MlWoo mentioned this pull request Feb 7, 2018

[feature request]Add an env variable to cover different pathes when testing code with openmp #5096

Open

mingfeima mentioned this pull request Feb 8, 2018

Feature Request: CPU performance optimization with MKL-DNN #4186

Open

ezyang added the open source label Jun 24, 2019

parallelize vol2col and col2vol of Conv3D with CPU backend #4824

parallelize vol2col and col2vol of Conv3D with CPU backend #4824

Uh oh!

Conversation

MlWoo commented Jan 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soumith commented Jan 24, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

MlWoo commented Jan 25, 2018

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

MlWoo commented Jan 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa commented Jan 30, 2018

Uh oh!

MlWoo commented Jan 30, 2018

Uh oh!

fmassa commented Jan 30, 2018

Uh oh!

MlWoo commented Jan 30, 2018

Uh oh!

fmassa commented Jan 30, 2018

Uh oh!

MlWoo commented Jan 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MlWoo commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MlWoo commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MlWoo commented Feb 1, 2018

Uh oh!

fmassa commented Feb 1, 2018

Uh oh!

MlWoo commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MlWoo commented Feb 5, 2018

Uh oh!

fmassa commented Feb 5, 2018

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa commented Feb 6, 2018

Uh oh!

MlWoo commented Feb 6, 2018

Uh oh!

soumith commented Feb 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MlWoo commented Jan 24, 2018 •

edited

Loading

MlWoo commented Jan 30, 2018 •

edited

Loading

MlWoo commented Jan 30, 2018 •

edited

Loading

MlWoo commented Jan 31, 2018 •

edited

Loading

MlWoo commented Jan 31, 2018 •

edited

Loading

MlWoo commented Feb 2, 2018 •

edited

Loading