-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Include support for the scatter gather cuda kernels to allow for comp… #124809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124809
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6bdcc8d with merge base b96b1e8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@janeyx99 Please let me know if I need to do anything else. |
|
Requesting for review from @mikaylagawarecki who's worked with scatter_gather and also from @ptrblck regarding the cuda side. |
|
@mikaylagawarecki @eqy I see that the pipeline is failing because of a linting issue. I used the lintrunner - I do not believe I touched this line on my own. https://github.com/pytorch/pytorch/actions/runs/8824920734/job/24228735042#step:11:244 I can change it if need be |
|
@eqy Anything that needs to be done on my end? |
|
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
|
Hi folks, since I'm new to Pytorch I'm curious to know what the procedure is now. Will this be reviewed by a core maintainer/contributor by way of triage? Do I need to do anything on my end? |
|
@pytorchbot merge |
|
I'm going to revert this at multiple internal builds failed with things like @mikaylagawarecki see D56861849 if you want to help re-landing this. |
|
@pytorchbot revert -m "breaking internal builds" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…for comp… (#124809)" This reverts commit 9e24c26. Reverted #124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](#124809 (comment)))
|
@ZelboK your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
|
@mikaylagawarecki Really sorry for the trouble :(
Is there a way for me to replicate one of these environments? Hopefully a devcontainer? I thought there was a workflow for internal builds that needed to pass before merging |
pytorch#124809) Fixes pytorch#121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: pytorch#124809 Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
…for comp… (#124809)" This reverts commit e09f98c. Reverted #124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](#124809 (comment)))
#124809) Fixes #121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: #124809 Approved by: https://github.com/mikaylagawarecki
| GPU_ATOMIC_INTEGER(Mul, a * b, int32_t) | ||
| GPU_ATOMIC_INTEGER(Mul, a * b, int64_t) | ||
|
|
||
| inline __device__ c10::complex<float> gpuAtomicMul(c10::complex<float> *address, c10::complex<float> val){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build failure is totally not your fault as it can't be seen from external CI, we only see it when internal workflows run after the PR is merged
Looking at the failure and pattern matching a bit, it looks like maybe we need __host__ __device__ here as well as for complex_max on 433.
Does this change make sense? I can import the PR and see whether this fixes the internal build tomorrow morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately since complex_min and complex_max both use CUDA intrinsics, they won't compile if you make it a __host__ function as well. The use of __fsqrt_rn for example should lead to more performant code/better CUDA assembly. CUDA intrinsics should be taken advantage of imo because it's kernel code and complex numbers are heavier computations to make in general.
The easiest solution would be to add an overload for complex when compiled with CUDA to have an operator*= available with __host__ __device__.
Just adding this
#if defined(__CUDACC__) || defined(__HIPCC__)
template <typename U>
C10_HOST_DEVICE constexpr complex<T>& operator*=(const complex<U>& rhs) {
// (a + bi) * (c + di) = (a*c - b*d) + (a * d + b * c) i
T a = real_;
T b = imag_;
U c = rhs.real();
U d = rhs.imag();
real_ = a * c - b * d;
imag_ = a * d + b * c;
return *this;
}
#endif
in complex.h should fix this problem.
Also on second look I made an oversight in the complex_max and complex_min functions. They should be using regular comparisons and not std::max given it's a __device__ function. So on that note, it's actually good that this PR got reverted! I will push those changes and things should build on your end.
|
@mikaylagawarecki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@ZelboK I think the *= update in complex.h you mentioned is still needed, as the internal build still has errors like such Also you seem to have accidentally commited ideep and TensorBase.cpp in this PR 😅 I can import this again to check after you make these fixes, so do let me know when! fyi: In case you don't get a response from me, wanted to let you know that I will be out this coming week but will be back on Monday (5/13) |
😂 my bad, didn't mean to commit that. All good, thanks for the prompt assistance! I enjoyed working through this, including the build failures and all. I appreciate the helpfulness from the team ❤️ I'll push |
| old = atomicCAS(addr_as_ull, assumed, new_val); | ||
| } while (assumed != old); | ||
|
|
||
| return *reinterpret_cast<c10::complex<float>*>(&addr_as_ull); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't atomic? You need to return csum directly, otherwise the value at addr_as_ull may change underneath you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this is also wrong as atomic read-modify-write ops return the old value, not the new value. So this should be bit-casting assumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this is also wrong as atomic read-modify-write ops return the old value, not the new value. So this should be bit-casting
assumed.
Sorry for the oversight. Could you help me understand? I know that atomicCAS returns the old value but with what in mind are you referring that to?
I understand that addr_as_ull sholdn't be returned, as as another thread can change it correct? Why are we to use assumed though and not csum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assumed is the value before performing the update, which is what is returned by normal atomicAdd, atomicMax, etc.
See the CAS implementation for half as an example:
pytorch/aten/src/ATen/cuda/Atomic.cuh
Lines 33 to 34 in 6bdcc8d
| hsum.x = (size_t)address & 2 ? (old >> 16) : (old & 0xffff); | |
| return hsum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. I got tunnel visioned on the line on the actual call of atomicCAS, yes it should be assumed. i forgot that I am actually implementing an atomic operation here and that it should follow suit lol
| __fmul_rn(b.imag(),b.imag()) | ||
| ) | ||
| ); | ||
| return (a_magnitude > b_magnitude) ? a : b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any precedence for this definition of complex max/min in PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not experienced enough with Pytorch to answer that. Aside from using magnitudes how else would you order them? I followed convention from other ecosystems and from my research this is how it is done across different disciplines/domains, is it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, they cannot be ordered (in mathematical terms, complex numbers are not an ordered field)
We should error in these cases, same as we error when we call max on a complex tensor. If people want to use these ops on complex tensors, they can do a view_as_real and perform some transformations on the output to define the order they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that some contexts will use magnitude for ordering complex numbers, like spectral analysis for DSP. I also took motivation from https://www.mathworks.com/help/matlab/ref/max.html as well
Could you comment on whether or not you had a use case for scattering complex numbers? What kind of work were you trying to do? Would you know if ordering of complex numbers is practically useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be implemented and it may be useful, but we don't implement that in PyTorch at a kernel level. As mentioned above, all these orderings can often be simulated with the current API and a bit of imagination :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - in that case I'll wait until @mikaylagawarecki has a chance to review again. Thanks for taking a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZelboK I am working on Tensor_network, which require a series of matrix multiplication. In the case of complex elements, the torch scatter connot be used in GPU. As far as I concerned right now, we do not use sortage here, just elements' multiplication.
By the way, we have worked a way out, that is, we transform the complex number through Euler transformation and turns the multiplication to addition of angles and multiplication of magnitude.
The excellent work of you guys has reached out of my knowledge base, I connot give anymore advices. But thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZelboK I am working on Tensor_network, which require a series of matrix multiplication. In the case of complex elements, the torch scatter connot be used in GPU. As far as I concerned right now, we do not use sortage here, just elements' multiplication. By the way, we have worked a way out, that is, we transform the complex number through Euler transformation and turns the multiplication to addition of angles and multiplication of magnitude. The excellent work of you guys has reached out of my knowledge base, I connot give anymore advices. But thanks!
Thanks a lot for responding, I was genuinely curious. This helps give me perspective :)
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Fixes #121965
This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include
complex<float>for now ascomplex<double>, for example, will be more complicated.C++ unit tests are currently passing alongside tests in
test_scatter_gather_ops.py. Python test suites also seem to be passing.Please keep the following in mind:
Environment:
3080 & WSL 2.
nvccis at 12.4.cc @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal