Add GPU kernel for `tf.bincount` #13813

yongtang · 2017-10-18T16:31:44Z

This fix tries to address the issue raised in #11554 where there is no GPU support for tf.bincount.

This fix adds GPU support for tf.bincount.

This fix fixes #11554.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

tensorflow-jenkins · 2017-10-18T16:31:45Z

Can one of the admins verify this patch?

ekelsen · 2017-10-18T17:40:57Z

Note that in the case of weights being passed to bincount, it is equivalent to unsorted_segment_sum. I would prefer a solution that just replaces bincount with a call to unsorted_segment_sum in this case (ideally in the graph after the output size is determined).

For the case where no weights are given and they are assumed to be one, then it would be best to have the implementation call CUB as it will be significantly faster.

vrv · 2017-10-18T18:31:57Z

@ringw can you take a look at this? Let me know if you want me to find someone else for the GPU kernel code review.

vrv · 2017-10-18T18:39:10Z

Missed @ekelsen's comment, probably should do that :)

yongtang · 2017-10-18T18:46:34Z

@ekelsen @vrv Thanks for the suggestion. Will spend some time and get CUB implementation added.

ekelsen · 2017-10-19T04:08:15Z

@yongtang thanks for taking this on! CUB for weights == 1 case will be great. Avoiding code duplication by calling the gpu kernel that already exists for unsorted_segment_sum in the case where weight != None will also be very helpful!

yongtang · 2017-10-19T20:56:48Z

@ekelsen @vrv The PR has been updated with CUB used for weights.size() == 0 and unsorted_segment_sum used for weights.size() != 0. Please take a look and let me know if there are any issues.

As cub::DeviceHistogram::HistogramEven depends on atomicAdd which does not have the double and half types, only float is enabled for CUB now. I will investigate further to see if there are ways to create wrapper for double and half versions of atomicAdd.

ekelsen

Thanks for these changes, they look good. I think we're almost there. Thanks for the help!

ekelsen · 2017-10-20T21:28:53Z

tensorflow/core/kernels/bincount_op.cc

+                          BincountOp<GPUDevice, type>)

-TF_CALL_NUMBER_TYPES(REGISTER);
+TF_CALL_float(REGISTER_KERNELS);


I think bincount is most useful for integer types, you should make sure to also add these (look at what TF_CALL_NUMBER_TYPES does, I believe there is a macro for only integer types)

Thanks. Done. Though on my dev machine it seems only int32 and float works. Might be because sm = 20? I added int32 to the list. Don't know if additional registration works with a different dev machine or not.

ekelsen · 2017-10-20T21:31:44Z

tensorflow/core/kernels/bincount_op_gpu.cu.cc

+#define REGISTER_GPU_SPEC(type) \
+  template struct functor::BincountFunctor<GPUDevice, type>;
+
+TF_CALL_float(REGISTER_GPU_SPEC)


again, I think the most useful types will be integer ones, not float. I think not working on half or double is not a problem.

ekelsen · 2017-10-20T21:33:26Z

tensorflow/python/kernel_tests/bincount_op_test.py

      self.assertAllEqual(math_ops.bincount([1], maxlength=3).eval(), [0, 1])
      self.assertAllEqual(math_ops.bincount([], maxlength=3).eval(), [])

  def test_random_with_weights(self):


Can you also add a test_random_without_weights ? I would feel better knowing some more sizes and types have been tested.

Thanks @ekelsen. test_random_without_weights has been added.

ekelsen · 2017-10-20T23:00:34Z

tensorflow/core/kernels/bincount_op.cc

+      auto output = output_t->flat<T>();
+      OP_REQUIRES_OK(
+          ctx, functor::BincountFunctor<Device, T>::Compute(ctx, arr, output));
+    } else {


Please handle the case of weights being None in the graph (ie edit the python code for the bincount op) by calling unsorted_segment_sum there after reduce_max. The C++ code for bincount only needs to handle the case where the weights are implicitly one.

It's better to not have high level ops alias to the same underlying calculation; it only makes analyzing the graph harder.

Thanks @ekelsen. That part has been moved to python side.

ringw · 2017-10-21T15:03:41Z

Sorry for the delay! Looks good pending @ekelsen's comments.

yongtang · 2017-10-22T01:06:17Z

Thanks @ekelsen @ringw for the review. The PR has been updated. Please take a look and let me know if there area any issues.

ekelsen · 2017-10-22T04:44:52Z

tensorflow/core/kernels/bincount_op.cc

+
+    OP_REQUIRES(
+        ctx, (weights_t.NumElements() == 0),
+        errors::InvalidArgument("Weights should not be passed as it should be "


I'm sorry, I should've said this earlier, the CPU implementation needs to retain the ability to handle weights != None. This is because of backwards compatibility - graphs that were created must still run correctly in a new version of TF. (This is only true for the CPU version, as the GPU version didn't exist).

So basically, the CPU code needs to continue to be able to handle weights, but the graph code should stay as it is so that all future graphs use unsorted_segment_sum instead. Once six months have passed, we can remove the CPU code, although more likely, it will simply stay around forever.

Thanks @ekelsen. The CPU implementation has been added back for weight != None.

ekelsen · 2017-10-22T04:47:28Z

tensorflow/python/kernel_tests/bincount_op_test.py

+    with self.test_session(use_gpu=True):
+      np.random.seed(42)
+      for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]:
+        arr = np.random.randint(0, 1000, num_samples)


Right now dtype is unused, I think you want to switch the numpy function depending on the type (and only test int32 and float32 if using a gpu for the test).

Thanks @ekelsen. The tests have been updated.

yongtang · 2017-10-23T00:55:18Z

Thanks @ekelsen for the review. The PR has been updated. Please take a look.

ekelsen

One final request - can you provide some benchmarks on both CPU and GPU of the new version?

Then I think we're ready. Thanks so much!

yongtang · 2017-10-24T23:37:22Z

Thanks @ekelsen. The PR has been updated with tests added. Here are some numbers for the new version:

Running main() from test_main.cc
Benchmark                  Time(ns) Iterations
----------------------------------------------
BM_Bincount_cpu_32_1000      114922       5150   285.1M items/s
BM_Bincount_cpu_32_2000      124291       5524   263.6M items/s
BM_Bincount_cpu_32_5000      159548       4287   205.4M items/s
BM_Bincount_cpu_64_1000      145006       4793   452.0M items/s
BM_Bincount_cpu_64_2000      150301       4457   436.0M items/s
BM_Bincount_cpu_64_5000      180001       3880   364.1M items/s
BM_Bincount_cpu_128_1000     204993       3405   639.4M items/s
BM_Bincount_cpu_128_2000     209144       3311   626.7M items/s
BM_Bincount_cpu_128_5000     231580       3003   566.0M items/s

BM_Bincount_gpu_32_1000       61178      10000   535.6M items/s
BM_Bincount_gpu_32_2000       61021      10000   537.0M items/s
BM_Bincount_gpu_32_5000       61177      10000   535.6M items/s
BM_Bincount_gpu_64_1000       61317      10000   1068.8M items/s
BM_Bincount_gpu_64_2000       60726      10000   1079.2M items/s
BM_Bincount_gpu_64_5000       61721      10000   1061.8M items/s
BM_Bincount_gpu_128_1000      69935      10000   1874.2M items/s
BM_Bincount_gpu_128_2000      79760       9852   1643.3M items/s
BM_Bincount_gpu_128_5000     100407       6974   1305.4M items/s

vrv · 2017-10-25T21:24:35Z

@tensorflow-jenkins test this please

@ekelsen can you take one final look?

yongtang · 2017-11-04T22:35:01Z

Thanks @gunan for the help. Let me take a look and fix it.

This commit splits BincountOp with GPU and CPU version. GPU implementation to follow. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

This fix tries to address the issue raised in 11554 where there is no GPU support for `tf.bincount`. This fix adds GPU support for `tf.bincount`. This fix fixes 11554. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Some run result: ``` Running main() from test_main.cc Benchmark Time(ns) Iterations ---------------------------------------------- BM_Bincount_cpu_32_1000 114922 5150 285.1M items/s BM_Bincount_cpu_32_2000 124291 5524 263.6M items/s BM_Bincount_cpu_32_5000 159548 4287 205.4M items/s BM_Bincount_cpu_64_1000 145006 4793 452.0M items/s BM_Bincount_cpu_64_2000 150301 4457 436.0M items/s BM_Bincount_cpu_64_5000 180001 3880 364.1M items/s BM_Bincount_cpu_128_1000 204993 3405 639.4M items/s BM_Bincount_cpu_128_2000 209144 3311 626.7M items/s BM_Bincount_cpu_128_5000 231580 3003 566.0M items/s BM_Bincount_gpu_32_1000 61178 10000 535.6M items/s BM_Bincount_gpu_32_2000 61021 10000 537.0M items/s BM_Bincount_gpu_32_5000 61177 10000 535.6M items/s BM_Bincount_gpu_64_1000 61317 10000 1068.8M items/s BM_Bincount_gpu_64_2000 60726 10000 1079.2M items/s BM_Bincount_gpu_64_5000 61721 10000 1061.8M items/s BM_Bincount_gpu_128_1000 69935 10000 1874.2M items/s BM_Bincount_gpu_128_2000 79760 9852 1643.3M items/s BM_Bincount_gpu_128_5000 100407 6974 1305.4M items/s ``` Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

libraries so that Jenkins could pass. Bincount should not be added `hidden_ops.txt` as it will cause compatibility test fail. And the following libs should not be needed in CI/CD: ``` - "@local_config_cuda//cuda:cublas", - "@local_config_cuda//cuda:cuda_driver", - "@local_config_cuda//cuda:cudnn", - "@local_config_cuda//cuda:cufft", - "@local_config_cuda//cuda:curand", ``` Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2017-11-05T16:23:14Z

@gunan The api compatibility test failure is caused by addition of Bincount into hidden.txt. I have removed it from hidden.txt.

For the gpu test linkage failure, on my dev machine previously I added the following so that it could pass locally:

        "@local_config_cuda//cuda:cublas",
        "@local_config_cuda//cuda:cuda_driver",
        "@local_config_cuda//cuda:cudnn",
        "@local_config_cuda//cuda:cufft",
        "@local_config_cuda//cuda:curand",

It seems that on CI/CD those libs should not be specified explicitly. I have removed those extra dependencies from the bazel BUILD build file. I think that will resolve the issue on Jenkins CI/CD.

yongtang · 2017-11-05T16:26:05Z

@gunan The PR has been updated. Can you start the Jenkins build again? Thanks for the help and patient during the process and really appreciate that.

gunan · 2017-11-05T17:29:51Z

Thank you for the updates.
Jenkins, test this please.

Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 233763b8efb4ab0045eb998b437c7b28c8f776c8 PiperOrigin-RevId: 646461734

Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d PiperOrigin-RevId: 646461734

Imported from GitHub PR openxla/xla#13813 gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape. The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass. Copybara import of the project: -- ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>: Assign a fixed index for cached activation Cache correct activation slice when contracting dim is sharded -- 233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>: Simplified logic in gpu einsum handler to be more generic -- 2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>: remove un-used variables Merging this change closes #13813 PiperOrigin-RevId: 646666635

googlebot added the cla: yes label Oct 18, 2017

yongtang mentioned this pull request Oct 18, 2017

Feature request: GPU support for tf.bincount #11554

Closed

vrv self-assigned this Oct 18, 2017

vrv requested a review from ringw October 18, 2017 18:31

vrv added the awaiting review Pull request awaiting review label Oct 18, 2017

vrv requested review from ekelsen and removed request for ringw October 19, 2017 03:45

vrv added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Oct 19, 2017

yongtang force-pushed the 11554-bincount-gpu branch from 5681489 to 2370aee Compare October 19, 2017 20:53

vrv added awaiting review Pull request awaiting review and removed stat:awaiting response Status - Awaiting response from author labels Oct 20, 2017

ekelsen reviewed Oct 20, 2017

View reviewed changes

yongtang force-pushed the 11554-bincount-gpu branch from a1efcaa to 8c0c7b5 Compare October 22, 2017 01:02

ekelsen reviewed Oct 22, 2017

View reviewed changes

ekelsen reviewed Oct 23, 2017

View reviewed changes

yongtang force-pushed the 11554-bincount-gpu branch from 2a91c25 to e73cc24 Compare October 24, 2017 23:36

vrv added the kokoro:force-run Tests on submitted change label Oct 25, 2017

yongtang added 13 commits November 5, 2017 15:52

Split BincountOp with GPU and CPU version

ed3b8fe

This commit splits BincountOp with GPU and CPU version. GPU implementation to follow. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Add GPU kernel for tf.bincount.

b83dcb2

This fix tries to address the issue raised in 11554 where there is no GPU support for `tf.bincount`. This fix adds GPU support for `tf.bincount`. This fix fixes 11554. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Update test cases for GPU support of tf.bincount

1470adb

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Address cases where input.size() == 0 or output.size() == 0

65d4e05

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Use CUB for histogram bincount when weight = 1.

1939f66

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Use unsorted_segment_sum when weights.size() != 0

b629942

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Remove unneeded GPU kernels.

5e7c14a

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Update Bazel BUILD file with Buildifier

70a4e8a

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Move unsorted_segment_sum to python part.

48708e6

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Address review comments

652df1b

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Address review feedback.

32d7541

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the 11554-bincount-gpu branch from e73cc24 to 95ec3de Compare November 5, 2017 16:19

gunan added the kokoro:force-run Tests on submitted change label Nov 5, 2017

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 5, 2017

gunan merged commit 9389c25 into tensorflow:master Nov 5, 2017

yongtang deleted the 11554-bincount-gpu branch November 5, 2017 20:14

davidbau mentioned this pull request Jan 9, 2018

Feature request: torch.bincount pytorch/pytorch#4570

Closed

copybara-service bot mentioned this pull request Jun 25, 2024

PR #13813: [NVIDIA GPU] Assign a fixed index for cached activation #70354

Merged

Add GPU kernel for tf.bincount #13813

Add GPU kernel for tf.bincount #13813

Uh oh!

Conversation

yongtang commented Oct 18, 2017

Uh oh!

tensorflow-jenkins commented Oct 18, 2017

Uh oh!

ekelsen commented Oct 18, 2017

Uh oh!

vrv commented Oct 18, 2017

Uh oh!

vrv commented Oct 18, 2017

Uh oh!

yongtang commented Oct 18, 2017

Uh oh!

ekelsen commented Oct 19, 2017

Uh oh!

yongtang commented Oct 19, 2017

Uh oh!

ekelsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ringw commented Oct 21, 2017

Uh oh!

yongtang commented Oct 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Oct 23, 2017

Uh oh!

ekelsen left a comment

Choose a reason for hiding this comment

Uh oh!

yongtang commented Oct 24, 2017

Uh oh!

vrv commented Oct 25, 2017

Uh oh!

yongtang commented Nov 4, 2017

Uh oh!

yongtang commented Nov 5, 2017

Uh oh!

yongtang commented Nov 5, 2017

Uh oh!

gunan commented Nov 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Add GPU kernel for `tf.bincount` #13813

Add GPU kernel for `tf.bincount` #13813