-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Add GPU kernel for tf.bincount
#13813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Note that in the case of weights being passed to bincount, it is equivalent to unsorted_segment_sum. I would prefer a solution that just replaces bincount with a call to unsorted_segment_sum in this case (ideally in the graph after the output size is determined). For the case where no weights are given and they are assumed to be one, then it would be best to have the implementation call CUB as it will be significantly faster. |
|
@ringw can you take a look at this? Let me know if you want me to find someone else for the GPU kernel code review. |
|
Missed @ekelsen's comment, probably should do that :) |
|
@yongtang thanks for taking this on! CUB for weights == 1 case will be great. Avoiding code duplication by calling the gpu kernel that already exists for unsorted_segment_sum in the case where weight != None will also be very helpful! |
5681489 to
2370aee
Compare
|
@ekelsen @vrv The PR has been updated with As |
ekelsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes, they look good. I think we're almost there. Thanks for the help!
| BincountOp<GPUDevice, type>) | ||
|
|
||
| TF_CALL_NUMBER_TYPES(REGISTER); | ||
| TF_CALL_float(REGISTER_KERNELS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bincount is most useful for integer types, you should make sure to also add these (look at what TF_CALL_NUMBER_TYPES does, I believe there is a macro for only integer types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done. Though on my dev machine it seems only int32 and float works. Might be because sm = 20? I added int32 to the list. Don't know if additional registration works with a different dev machine or not.
| #define REGISTER_GPU_SPEC(type) \ | ||
| template struct functor::BincountFunctor<GPUDevice, type>; | ||
|
|
||
| TF_CALL_float(REGISTER_GPU_SPEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, I think the most useful types will be integer ones, not float. I think not working on half or double is not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| self.assertAllEqual(math_ops.bincount([1], maxlength=3).eval(), [0, 1]) | ||
| self.assertAllEqual(math_ops.bincount([], maxlength=3).eval(), []) | ||
|
|
||
| def test_random_with_weights(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test_random_without_weights ? I would feel better knowing some more sizes and types have been tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ekelsen. test_random_without_weights has been added.
| auto output = output_t->flat<T>(); | ||
| OP_REQUIRES_OK( | ||
| ctx, functor::BincountFunctor<Device, T>::Compute(ctx, arr, output)); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please handle the case of weights being None in the graph (ie edit the python code for the bincount op) by calling unsorted_segment_sum there after reduce_max. The C++ code for bincount only needs to handle the case where the weights are implicitly one.
It's better to not have high level ops alias to the same underlying calculation; it only makes analyzing the graph harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ekelsen. That part has been moved to python side.
|
Sorry for the delay! Looks good pending @ekelsen's comments. |
a1efcaa to
8c0c7b5
Compare
|
|
||
| OP_REQUIRES( | ||
| ctx, (weights_t.NumElements() == 0), | ||
| errors::InvalidArgument("Weights should not be passed as it should be " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I should've said this earlier, the CPU implementation needs to retain the ability to handle weights != None. This is because of backwards compatibility - graphs that were created must still run correctly in a new version of TF. (This is only true for the CPU version, as the GPU version didn't exist).
So basically, the CPU code needs to continue to be able to handle weights, but the graph code should stay as it is so that all future graphs use unsorted_segment_sum instead. Once six months have passed, we can remove the CPU code, although more likely, it will simply stay around forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ekelsen. The CPU implementation has been added back for weight != None.
| with self.test_session(use_gpu=True): | ||
| np.random.seed(42) | ||
| for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]: | ||
| arr = np.random.randint(0, 1000, num_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now dtype is unused, I think you want to switch the numpy function depending on the type (and only test int32 and float32 if using a gpu for the test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ekelsen. The tests have been updated.
|
Thanks @ekelsen for the review. The PR has been updated. Please take a look. |
ekelsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final request - can you provide some benchmarks on both CPU and GPU of the new version?
Then I think we're ready. Thanks so much!
2a91c25 to
e73cc24
Compare
|
Thanks @ekelsen. The PR has been updated with tests added. Here are some numbers for the new version: |
|
@tensorflow-jenkins test this please @ekelsen can you take one final look? |
|
Thanks @gunan for the help. Let me take a look and fix it. |
This commit splits BincountOp with GPU and CPU version. GPU implementation to follow. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in 11554 where there is no GPU support for `tf.bincount`. This fix adds GPU support for `tf.bincount`. This fix fixes 11554. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Some run result: ``` Running main() from test_main.cc Benchmark Time(ns) Iterations ---------------------------------------------- BM_Bincount_cpu_32_1000 114922 5150 285.1M items/s BM_Bincount_cpu_32_2000 124291 5524 263.6M items/s BM_Bincount_cpu_32_5000 159548 4287 205.4M items/s BM_Bincount_cpu_64_1000 145006 4793 452.0M items/s BM_Bincount_cpu_64_2000 150301 4457 436.0M items/s BM_Bincount_cpu_64_5000 180001 3880 364.1M items/s BM_Bincount_cpu_128_1000 204993 3405 639.4M items/s BM_Bincount_cpu_128_2000 209144 3311 626.7M items/s BM_Bincount_cpu_128_5000 231580 3003 566.0M items/s BM_Bincount_gpu_32_1000 61178 10000 535.6M items/s BM_Bincount_gpu_32_2000 61021 10000 537.0M items/s BM_Bincount_gpu_32_5000 61177 10000 535.6M items/s BM_Bincount_gpu_64_1000 61317 10000 1068.8M items/s BM_Bincount_gpu_64_2000 60726 10000 1079.2M items/s BM_Bincount_gpu_64_5000 61721 10000 1061.8M items/s BM_Bincount_gpu_128_1000 69935 10000 1874.2M items/s BM_Bincount_gpu_128_2000 79760 9852 1643.3M items/s BM_Bincount_gpu_128_5000 100407 6974 1305.4M items/s ``` Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
libraries so that Jenkins could pass. Bincount should not be added `hidden_ops.txt` as it will cause compatibility test fail. And the following libs should not be needed in CI/CD: ``` - "@local_config_cuda//cuda:cublas", - "@local_config_cuda//cuda:cuda_driver", - "@local_config_cuda//cuda:cudnn", - "@local_config_cuda//cuda:cufft", - "@local_config_cuda//cuda:curand", ``` Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
e73cc24 to
95ec3de
Compare
|
@gunan The api compatibility test failure is caused by addition of For the gpu test linkage failure, on my dev machine previously I added the following so that it could pass locally: It seems that on CI/CD those libs should not be specified explicitly. I have removed those extra dependencies from the bazel |
|
@gunan The PR has been updated. Can you start the Jenkins build again? Thanks for the help and patient during the process and really appreciate that. |
|
Thank you for the updates. |
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 233763b8efb4ab0045eb998b437c7b28c8f776c8 PiperOrigin-RevId: 646461734
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 233763b8efb4ab0045eb998b437c7b28c8f776c8 PiperOrigin-RevId: 646461734
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d PiperOrigin-RevId: 646461734
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d PiperOrigin-RevId: 646461734
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d PiperOrigin-RevId: 646461734
Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 PiperOrigin-RevId: 646666635
This fix tries to address the issue raised in #11554 where there is no GPU support for
tf.bincount.This fix adds GPU support for
tf.bincount.This fix fixes #11554.
Signed-off-by: Yong Tang yong.tang.github@outlook.com