-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Use fast integer division algorithm to avoid division ops inside kernels. #5054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- OffsetInfo and OffsetIterator pre-computes the necessary coordinate
change along each dimension, so that each successive offset can be
computed using only addition/subtraction/comparisons.
- Added IntDivider which supports "magic division" for uint32_t, thus
eliminating integer divisions altogether for offset calculation, as
long as indices fit in 32 bits.
- In code paths with statically determined dimensions (Dims=1 or 2),
kernel arguments now contain only the necessary data (instead of
MAX_CUTORCH_DIMS of everything).
- Fixed index overflow errors: for tensors with >= 2G elements, we used
to have incorrect results or an infinite loop inside the kernel.
TODO: The following pattern is broken for tensors with >= 2G elements.
It will result in overflow, even if IndexType is uint64_t. Need
to search and replace them.
> for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x;
> linearIndex < totalElements;
> linearIndex += gridDim.x * blockDim.x) {
|
This PR improves some float operations by ~20% (and some operations on ByteTensor by up to ~45%), but in general the performance impact seems small, unless one uses a lot of non-contiguous tensors and/or broadcasting with large dimensions. Here's an example where I could get ~20% improvement on GTX 1080: I found at least one case where it becomes slower by ~5%, but such cases seem to be rare, so I still think it's a net performance win on average, although small. Raw benchmark results are https://github.com/yongjik/pt_test/tree/master/results/offset in case anybody's interested.
|
|
@pytorchbot add to whitelist |
wickedfoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the code as you have it results in a performance win, and it makes the code a lot more complicated. 7.6 us to 6.0 us is within the realm of noise, and such changes are sensitive to heuristics used in the register allocator and in other places.
Replacing the linear index with a per-dimension index will bloat out the register count, and the code within the new iteration stuff looks like it has divergent/predicated execution paths as well.
However, I do believe that constant integer division via multiplication/shift by constants is worth trying. Your magic number division algorithm can be simplified by restricting its usage to the case 2 to max signed int (see comments).
Can you do a more minimal diff keeping the old kernel structure and the linear index -> offset lookup trying the faster version of the magic constant division algorithm, with a fallback to using normal integer div/mod if it falls outside the range under consideration?
For performance testing, I would concentrate on sufficiently large tensor sizes, say a large tensor (multi-100 MB+ in size) that is transposed on which you perform pointwise operations. A kernel that executes in just microseconds I think is likely to fall within the margin of noise.
Also I would inspect the SASS to see what instructions it was emitting before for integer div/mod (I believe it tries to map it to floating point inverse, when I recall looking a long time ago), and see what instructions it actually issues for umulhi as well.
| #ifdef __CUDA_ARCH__ | ||
| // 't' is the higher 32-bits of unsigned 32-bit multiplication of 'n' and | ||
| // 'm1'. | ||
| unsigned int t = __umulhi(n, m1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCIntegerDivider.cuh
Outdated
| // 't' is the higher 32-bits of unsigned 32-bit multiplication of 'n' and | ||
| // 'm1'. | ||
| unsigned int t = __umulhi(n, m1); | ||
| unsigned int t2 = t + ((n - t) >> s1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCOffsetInfo.cuh
Outdated
| { | ||
| bool carry = false; | ||
|
|
||
| for (int i = dims - 1; i > 0; --i) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCOffsetInfo.cuh
Outdated
| bool carry = false; | ||
|
|
||
| for (int i = dims - 1; i > 0; --i) { | ||
| IndexType index = indices[i] + increments[i] + (IndexType) carry; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCApply.cuh
Outdated
| typename IndexType, | ||
| int ADims> | ||
| #if __CUDA_ARCH__ >= 350 | ||
| __launch_bounds__(THC_APPLY_THREADS_PER_BLOCK, THC_APPLY_BLOCKS_PER_SM) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCOffsetInfo.cuh
Outdated
| IndexType next = index + step; | ||
|
|
||
| // The second condition is necessary to handle overflow (e.g., when step is | ||
| // 2GB and limit is 3GB, assuming 32-bit index). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| const OffsetInfo<Tb, IndexType, BDims> b, | ||
| IndexType totalElements, | ||
| Op op) { | ||
| for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Hi @wickedfoo, thanks for the detailed review, and I understand your point that the code is too complicated for the (rather unimpressive) speedup. I'll try just using the constant division algorithm and get back to you. Might take a few days. On the other hand, I do think there's a measurable speedup for some cases. One case I found: Ironically, using even larger tensor doesn't show larger speedup, because then (I suppose) memory bandwidth dominates everything. |
|
@yongjik also take a look at https://github.com/milakov/int_fastdiv |
|
Also the reason that IndexType was |
|
The integer division by magic constants code in the Caffe2 source I think will be faster than int_fastdiv if you exclude the -1 / 1 case. They're basically the same code more or less, except you avoid this additional work: https://github.com/milakov/int_fastdiv/blob/master/int_fastdiv.h#L126 |
|
@yongjik I suffered a lot tuning the |
- Also changed canUse32BitIndexMath so that the max index for 32-bit math is INT32_MAX, instead of UINT32_MAX. It also simplifies the division operation.
|
Hi @wickedfoo, I updated the code to remove the increment stuff and only leave the int division algorithm. Could you take another look? Regarding signed/unsigned integer, I think the point is moot, because (in the references I found) the fast division algorithm for signed integers always has more operations than the unsigned version. So I think they don't really give us any benefit here. |
|
Hi guys, any thoughts on this PR? |
|
Hi @wickedfoo, could you give your opinion? If this PR still looks like too much complication, I understand if you don't want to merge this, but I'd appreciate a decision rather than this PR staying in limbo forever. Thanks! |
|
@yongjik i think he does not get github notification emails. I will ping him directly. sorry for delay. |
|
Looking now. |
wickedfoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Any idea what the performance change of this is (i.e., is it worth it, and for what sizes)?
aten/src/THC/THCOffsetInfo.cuh
Outdated
| __host__ __device__ T* get(IndexType linearIndex) const { | ||
| IndexType offset = 0; | ||
|
|
||
| for (int i = tinfo.dims - 1; i > 0; --i) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCOffsetInfo.cuh
Outdated
| @@ -0,0 +1,89 @@ | |||
| #ifndef THC_OFFSET_INFO_INC | |||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Hi @wickedfoo, thanks for the review. I ran several hundred configurations of tensor operations (on GTX 1080 / CUDA 9.1), including The biggest win I could find was: We also have speedup for float operations, though not as dramatic: For some other float operations, I observed speedup of ~25% for mid-size tensors (around 1000x128), but it becomes smaller as tensors get bigger (~9% for 1024x1024, ~3% for 8000x3000), probably because memory latency dominates everything for these tensors. |
|
I don't know if I'm doing it right, but I followed the advice of the failed test log and ran On clean branch (2726550, ran three times): With this PR on top of it: So I think there's no meaningful difference on GTX-1080, but other GPUs might report different numbers, I guess. |
|
The GPU perf tests have been flaky recently, so you should ignore them for the purposes of assessing this PR. |
|
thanks @yongjik. sorry for the delay in review. |
|
No worries! Half of the delay was mine, after all. Thanks for the review. |
OffsetInfo and OffsetIterator pre-computes the necessary coordinate
change along each dimension, so that each successive offset can be
computed using only addition/subtraction/comparisons.
Added IntDivider which supports "magic division" for uint32_t, thus
eliminating integer divisions altogether for offset calculation, as
long as indices fit in 32 bits.
In code paths with statically determined dimensions (Dims=1 or 2),
kernel arguments now contain only the necessary data (instead of
MAX_CUTORCH_DIMS of everything).
Fixed index overflow errors: for tensors with >= 2G elements, we used
to have incorrect results or an infinite loop inside the kernel.
TODO: The following pattern is broken for tensors with >= 2G elements.
It will result in overflow, even if IndexType is uint64_t. Need
to search and replace them.