-
Notifications
You must be signed in to change notification settings - Fork 26.3k
parallelize vol2col and col2vol of Conv3D with CPU backend #4824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@fmassa can you review this tomorrow. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
I am sorry that I did not use git correctly. It must be a better way to keep the all commit in this PR. I have added |
fmassa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good!
This implementation closely follows the CUDA implementation, so reviewing it gets easier with that in mind, specially because there are a number of intermediate variables that are used to cache some intermediate results and whose meaning might be a bit obscure.
One thing is that I think it might be better to replace all instances of long to a fixed size type, like int64_t. Those long variables were already there before this PR, so this might be changed at a later time, in which case we should open an issue then to keep track of it.
@soumith Our speed regression tests do not check for correctness, right? This should not matter for this PR, but for #2764 all the optimizations were not being tested for correctness because the tensor sizes that were used were small enough not hit the OMP_THRESHOLD, so the old code-path was used for the tests. It might be useful to have some tests somewhere that exercise those codepaths, but they might be slow and thus run separately from the common tests?
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@fmassa It is maybe a good idea to set an environment variables of |
|
@MlWoo yes, we should not use About having an environment variable for |
|
@MlWoo that could be a possibility, but wouldn't it be an overkill at the I was initially thinking about adding such an env var in |
|
@fmassa The env variable in the |
|
@MlWoo ok. I think it might be good to do something like that for |
|
@fmassa It is a very good idea. We can start with some OPs. Could you provide a table to list the common and typical OPs? We also considerate some OPs related with our work firstly. |
|
It's very strange that the latest modification results in the dataloader test failure but it just involves with the intermediate computation process of Conv3D on CPU. And it only failed to pass the test on linux with Cuda. I am afraid that I maybe put off the PR. |
|
@fmassa I doubt that there are some problems in test paltform in past hours. There are different test results in different time,but the code is same. |
|
I really want to say somthing about the test platform. But whatever, the code pass it now. @fmassa Could you review the code again? Especially the |
|
Could you spare time to review the code?If it's OK, I could open a new issue on other conv to improve their performance probably. |
|
@MlWoo I'll have another look at it today and I'll let you know |
fmassa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @MlWoo!
I think this looks very good, I have no further comments.
And yes, I have the impression that conv2d could benefit from similar optimizations, and that would be a great addition! Just note though that conv2d dispatches to NNPack in some cases, which should be very efficient, both memory-wise as well as speed-wise.
Quick comment for if you decide to work on improving the other variants of conv3d:
Looking around, I realized that we unfortunately have a lot of code duplication on VolumetricFullDilatedConvolution and VolumetricConvolutionMM, and we have two versions of their unfold (or vol2col) implementations. The main difference seems to be that we don't store the finput for the whole batch, saving on memory, but losing on speed, because we can't parallelize over the batch dimension in the same way. A similar thing happens in conv2d and conv2d_transpose, where we use two different implementations of im2col (or unfold).
It would be great at some point to remove the redundancy and use a single implementation, but that would require discussing the trade-offs between speed/memory.
|
@soumith I think this can be merged |
|
@fmassa Thanks a lot for your praise . It really needs more discussion about the trade-offs between speed/memory. Your point is very professional. Thank you. |
|
thanks @MlWoo ! |
Refer to the Issue.
Vol2col and col2vol in Conv3D could not be parallelized with CPU backend. Actually
THNN_(unfolded_copy_vol)function is OK to use OpenMP. But test case will fail when using OpenMP inTHNN_(unfolded_acc_vol)function . The reason why the transforms could not be parallized with CPU backend is read-write confict with multi-thread. However, it will avert the problem if we adopt another coding strategy as that in GPU backend.