Skip to content

Conversation

@MlWoo
Copy link
Contributor

@MlWoo MlWoo commented Jan 24, 2018

Refer to the Issue.
Vol2col and col2vol in Conv3D could not be parallelized with CPU backend. Actually THNN_(unfolded_copy_vol) function is OK to use OpenMP. But test case will fail when using OpenMP in THNN_(unfolded_acc_vol) function . The reason why the transforms could not be parallized with CPU backend is read-write confict with multi-thread. However, it will avert the problem if we adopt another coding strategy as that in GPU backend.

@soumith
Copy link
Contributor

soumith commented Jan 24, 2018

@fmassa can you review this tomorrow.

This comment was marked as off-topic.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 25, 2018

I am sorry that I did not use git correctly. It must be a better way to keep the all commit in this PR. I have added #ifdef _OPENMP according to the suggestion provided by @soumith .

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good!

This implementation closely follows the CUDA implementation, so reviewing it gets easier with that in mind, specially because there are a number of intermediate variables that are used to cache some intermediate results and whose meaning might be a bit obscure.

One thing is that I think it might be better to replace all instances of long to a fixed size type, like int64_t. Those long variables were already there before this PR, so this might be changed at a later time, in which case we should open an issue then to keep track of it.

@soumith Our speed regression tests do not check for correctness, right? This should not matter for this PR, but for #2764 all the optimizations were not being tested for correctness because the tensor sizes that were used were small enough not hit the OMP_THRESHOLD, so the old code-path was used for the tests. It might be useful to have some tests somewhere that exercise those codepaths, but they might be slow and thus run separately from the common tests?

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 30, 2018

@fmassa It is maybe a good idea to set an environment variables of OMP_THRESHOLD to control code path as that to choose intrinsics to vectorize operation.

@fmassa
Copy link
Member

fmassa commented Jan 30, 2018

@MlWoo yes, we should not use long types anymore, but instead int64_t, as we are working towards having Windows support and long can be 32 bits.

About having an environment variable for OMP_THRESHOLD, yeah, that might be useful for testing purposes (and might make it easier to tune this parameter for specific machines).

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 30, 2018

@fmassa
Is it OK to get env variable in init.c file and set a macro which can be accessed in every OPs file in THNN module? Maybe it is OK to handle with it in similar way in TH module.

@fmassa
Copy link
Member

fmassa commented Jan 30, 2018

@MlWoo that could be a possibility, but wouldn't it be an overkill at the THNN level in its current state? Or do you have plans to tune other operations in THNN, that would benefit from a fine-grained control?

I was initially thinking about adding such an env var in TH to make the tests check the OMP path as well, without it being too slow because of large tensors.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 30, 2018

@fmassa The env variable in the THNN or TH module in that level is used to test both the code paths with or without openmp but not to tune the appropriate openmp threshold. If pytorch wants to control the openmp more finely, maybe it provides a macro table to look up. It is a trival job to tune the operations in THNN and TH modules in consideration of different CPU platforms. We could design a scripts to help us do that work if pytorch thinks it's necessary.

@fmassa
Copy link
Member

fmassa commented Jan 30, 2018

@MlWoo ok. I think it might be good to do something like that for TH, but on a separate PR.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 30, 2018

@fmassa It is a very good idea. We can start with some OPs. Could you provide a table to list the common and typical OPs? We also considerate some OPs related with our work firstly.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 31, 2018

It's very strange that the latest modification results in the dataloader test failure but it just involves with the intermediate computation process of Conv3D on CPU. And it only failed to pass the test on linux with Cuda. I am afraid that I maybe put off the PR.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 31, 2018

@fmassa I have compared two logs of the PR and ohter PR #4953. The same error occurs both in the two PR. My log failed to pass and terminated, but the other one is OK. Could you help me check what happens?

@MlWoo
Copy link
Contributor Author

MlWoo commented Feb 1, 2018

@fmassa I doubt that there are some problems in test paltform in past hours. There are different test results in different time,but the code is same.

@fmassa
Copy link
Member

fmassa commented Feb 1, 2018

@MlWoo I believe the build was broken by #4943, but this should have been fixed by #4980

@MlWoo
Copy link
Contributor Author

MlWoo commented Feb 2, 2018

I really want to say somthing about the test platform. But whatever, the code pass it now. @fmassa Could you review the code again? Especially the unfolded_copy_vol. I have used new method to implement it again. I think conv2d, conv3d and their variants will benefit more from the new mehod. Maybe I should open a new issue.

@MlWoo
Copy link
Contributor Author

MlWoo commented Feb 5, 2018

Could you spare time to review the code?If it's OK, I could open a new issue on other conv to improve their performance probably.

@fmassa
Copy link
Member

fmassa commented Feb 5, 2018

@MlWoo I'll have another look at it today and I'll let you know

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @MlWoo!
I think this looks very good, I have no further comments.

And yes, I have the impression that conv2d could benefit from similar optimizations, and that would be a great addition! Just note though that conv2d dispatches to NNPack in some cases, which should be very efficient, both memory-wise as well as speed-wise.

Quick comment for if you decide to work on improving the other variants of conv3d:
Looking around, I realized that we unfortunately have a lot of code duplication on VolumetricFullDilatedConvolution and VolumetricConvolutionMM, and we have two versions of their unfold (or vol2col) implementations. The main difference seems to be that we don't store the finput for the whole batch, saving on memory, but losing on speed, because we can't parallelize over the batch dimension in the same way. A similar thing happens in conv2d and conv2d_transpose, where we use two different implementations of im2col (or unfold).
It would be great at some point to remove the redundancy and use a single implementation, but that would require discussing the trade-offs between speed/memory.

@fmassa
Copy link
Member

fmassa commented Feb 6, 2018

@soumith I think this can be merged

@MlWoo
Copy link
Contributor Author

MlWoo commented Feb 6, 2018

@fmassa Thanks a lot for your praise . It really needs more discussion about the trade-offs between speed/memory. Your point is very professional. Thank you.

@soumith soumith merged commit 13ef843 into pytorch:master Feb 6, 2018
@soumith
Copy link
Contributor

soumith commented Feb 6, 2018

thanks @MlWoo !

@MlWoo MlWoo deleted the dev-omp branch February 7, 2018 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants