Skip to content

Conversation

@yongtang
Copy link
Member

This fix tries to address the issue raised in #11554 where there is no GPU support for tf.bincount.

This fix adds GPU support for tf.bincount.

This fix fixes #11554.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@ekelsen
Copy link
Contributor

ekelsen commented Oct 18, 2017

Note that in the case of weights being passed to bincount, it is equivalent to unsorted_segment_sum. I would prefer a solution that just replaces bincount with a call to unsorted_segment_sum in this case (ideally in the graph after the output size is determined).

For the case where no weights are given and they are assumed to be one, then it would be best to have the implementation call CUB as it will be significantly faster.

@vrv vrv self-assigned this Oct 18, 2017
@vrv vrv requested a review from ringw October 18, 2017 18:31
@vrv
Copy link

vrv commented Oct 18, 2017

@ringw can you take a look at this? Let me know if you want me to find someone else for the GPU kernel code review.

@vrv vrv added the awaiting review Pull request awaiting review label Oct 18, 2017
@vrv
Copy link

vrv commented Oct 18, 2017

Missed @ekelsen's comment, probably should do that :)

@yongtang
Copy link
Member Author

@ekelsen @vrv Thanks for the suggestion. Will spend some time and get CUB implementation added.

@vrv vrv requested review from ekelsen and removed request for ringw October 19, 2017 03:45
@vrv vrv added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Oct 19, 2017
@ekelsen
Copy link
Contributor

ekelsen commented Oct 19, 2017

@yongtang thanks for taking this on! CUB for weights == 1 case will be great. Avoiding code duplication by calling the gpu kernel that already exists for unsorted_segment_sum in the case where weight != None will also be very helpful!

@yongtang
Copy link
Member Author

@ekelsen @vrv The PR has been updated with CUB used for weights.size() == 0 and unsorted_segment_sum used for weights.size() != 0. Please take a look and let me know if there are any issues.

As cub::DeviceHistogram::HistogramEven depends on atomicAdd which does not have the double and half types, only float is enabled for CUB now. I will investigate further to see if there are ways to create wrapper for double and half versions of atomicAdd.

@vrv vrv added awaiting review Pull request awaiting review and removed stat:awaiting response Status - Awaiting response from author labels Oct 20, 2017
Copy link
Contributor

@ekelsen ekelsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes, they look good. I think we're almost there. Thanks for the help!

BincountOp<GPUDevice, type>)

TF_CALL_NUMBER_TYPES(REGISTER);
TF_CALL_float(REGISTER_KERNELS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bincount is most useful for integer types, you should make sure to also add these (look at what TF_CALL_NUMBER_TYPES does, I believe there is a macro for only integer types)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done. Though on my dev machine it seems only int32 and float works. Might be because sm = 20? I added int32 to the list. Don't know if additional registration works with a different dev machine or not.

#define REGISTER_GPU_SPEC(type) \
template struct functor::BincountFunctor<GPUDevice, type>;

TF_CALL_float(REGISTER_GPU_SPEC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I think the most useful types will be integer ones, not float. I think not working on half or double is not a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.assertAllEqual(math_ops.bincount([1], maxlength=3).eval(), [0, 1])
self.assertAllEqual(math_ops.bincount([], maxlength=3).eval(), [])

def test_random_with_weights(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test_random_without_weights ? I would feel better knowing some more sizes and types have been tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ekelsen. test_random_without_weights has been added.

auto output = output_t->flat<T>();
OP_REQUIRES_OK(
ctx, functor::BincountFunctor<Device, T>::Compute(ctx, arr, output));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please handle the case of weights being None in the graph (ie edit the python code for the bincount op) by calling unsorted_segment_sum there after reduce_max. The C++ code for bincount only needs to handle the case where the weights are implicitly one.

It's better to not have high level ops alias to the same underlying calculation; it only makes analyzing the graph harder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ekelsen. That part has been moved to python side.

@ringw
Copy link
Contributor

ringw commented Oct 21, 2017

Sorry for the delay! Looks good pending @ekelsen's comments.

@yongtang
Copy link
Member Author

Thanks @ekelsen @ringw for the review. The PR has been updated. Please take a look and let me know if there area any issues.


OP_REQUIRES(
ctx, (weights_t.NumElements() == 0),
errors::InvalidArgument("Weights should not be passed as it should be "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I should've said this earlier, the CPU implementation needs to retain the ability to handle weights != None. This is because of backwards compatibility - graphs that were created must still run correctly in a new version of TF. (This is only true for the CPU version, as the GPU version didn't exist).

So basically, the CPU code needs to continue to be able to handle weights, but the graph code should stay as it is so that all future graphs use unsorted_segment_sum instead. Once six months have passed, we can remove the CPU code, although more likely, it will simply stay around forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ekelsen. The CPU implementation has been added back for weight != None.

with self.test_session(use_gpu=True):
np.random.seed(42)
for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]:
arr = np.random.randint(0, 1000, num_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now dtype is unused, I think you want to switch the numpy function depending on the type (and only test int32 and float32 if using a gpu for the test).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ekelsen. The tests have been updated.

@yongtang
Copy link
Member Author

Thanks @ekelsen for the review. The PR has been updated. Please take a look.

Copy link
Contributor

@ekelsen ekelsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final request - can you provide some benchmarks on both CPU and GPU of the new version?

Then I think we're ready. Thanks so much!

@yongtang
Copy link
Member Author

Thanks @ekelsen. The PR has been updated with tests added. Here are some numbers for the new version:

Running main() from test_main.cc
Benchmark                  Time(ns) Iterations
----------------------------------------------
BM_Bincount_cpu_32_1000      114922       5150   285.1M items/s
BM_Bincount_cpu_32_2000      124291       5524   263.6M items/s
BM_Bincount_cpu_32_5000      159548       4287   205.4M items/s
BM_Bincount_cpu_64_1000      145006       4793   452.0M items/s
BM_Bincount_cpu_64_2000      150301       4457   436.0M items/s
BM_Bincount_cpu_64_5000      180001       3880   364.1M items/s
BM_Bincount_cpu_128_1000     204993       3405   639.4M items/s
BM_Bincount_cpu_128_2000     209144       3311   626.7M items/s
BM_Bincount_cpu_128_5000     231580       3003   566.0M items/s

BM_Bincount_gpu_32_1000       61178      10000   535.6M items/s
BM_Bincount_gpu_32_2000       61021      10000   537.0M items/s
BM_Bincount_gpu_32_5000       61177      10000   535.6M items/s
BM_Bincount_gpu_64_1000       61317      10000   1068.8M items/s
BM_Bincount_gpu_64_2000       60726      10000   1079.2M items/s
BM_Bincount_gpu_64_5000       61721      10000   1061.8M items/s
BM_Bincount_gpu_128_1000      69935      10000   1874.2M items/s
BM_Bincount_gpu_128_2000      79760       9852   1643.3M items/s
BM_Bincount_gpu_128_5000     100407       6974   1305.4M items/s

@vrv
Copy link

vrv commented Oct 25, 2017

@tensorflow-jenkins test this please

@ekelsen can you take one final look?

@vrv vrv added the kokoro:force-run Tests on submitted change label Oct 25, 2017
@yongtang
Copy link
Member Author

yongtang commented Nov 4, 2017

Thanks @gunan for the help. Let me take a look and fix it.

This commit splits BincountOp with GPU and CPU version.
GPU implementation to follow.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in 11554 where
there is no GPU support for `tf.bincount`.

This fix adds GPU support for `tf.bincount`.

This fix fixes 11554.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Some run result:
```
Running main() from test_main.cc
Benchmark                  Time(ns) Iterations
----------------------------------------------
BM_Bincount_cpu_32_1000      114922       5150   285.1M items/s
BM_Bincount_cpu_32_2000      124291       5524   263.6M items/s
BM_Bincount_cpu_32_5000      159548       4287   205.4M items/s
BM_Bincount_cpu_64_1000      145006       4793   452.0M items/s
BM_Bincount_cpu_64_2000      150301       4457   436.0M items/s
BM_Bincount_cpu_64_5000      180001       3880   364.1M items/s
BM_Bincount_cpu_128_1000     204993       3405   639.4M items/s
BM_Bincount_cpu_128_2000     209144       3311   626.7M items/s
BM_Bincount_cpu_128_5000     231580       3003   566.0M items/s

BM_Bincount_gpu_32_1000       61178      10000   535.6M items/s
BM_Bincount_gpu_32_2000       61021      10000   537.0M items/s
BM_Bincount_gpu_32_5000       61177      10000   535.6M items/s
BM_Bincount_gpu_64_1000       61317      10000   1068.8M items/s
BM_Bincount_gpu_64_2000       60726      10000   1079.2M items/s
BM_Bincount_gpu_64_5000       61721      10000   1061.8M items/s
BM_Bincount_gpu_128_1000      69935      10000   1874.2M items/s
BM_Bincount_gpu_128_2000      79760       9852   1643.3M items/s
BM_Bincount_gpu_128_5000     100407       6974   1305.4M items/s
```

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
libraries so that Jenkins could pass.

Bincount should not be added `hidden_ops.txt` as it will cause
compatibility test fail.

And the following libs should not be needed in CI/CD:
```
-        "@local_config_cuda//cuda:cublas",
-        "@local_config_cuda//cuda:cuda_driver",
-        "@local_config_cuda//cuda:cudnn",
-        "@local_config_cuda//cuda:cufft",
-        "@local_config_cuda//cuda:curand",
```

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang
Copy link
Member Author

yongtang commented Nov 5, 2017

@gunan The api compatibility test failure is caused by addition of Bincount into hidden.txt. I have removed it from hidden.txt.

For the gpu test linkage failure, on my dev machine previously I added the following so that it could pass locally:

        "@local_config_cuda//cuda:cublas",
        "@local_config_cuda//cuda:cuda_driver",
        "@local_config_cuda//cuda:cudnn",
        "@local_config_cuda//cuda:cufft",
        "@local_config_cuda//cuda:curand",

It seems that on CI/CD those libs should not be specified explicitly. I have removed those extra dependencies from the bazel BUILD build file. I think that will resolve the issue on Jenkins CI/CD.

@yongtang
Copy link
Member Author

yongtang commented Nov 5, 2017

@gunan The PR has been updated. Can you start the Jenkins build again? Thanks for the help and patient during the process and really appreciate that.

@gunan
Copy link
Contributor

gunan commented Nov 5, 2017

Thank you for the updates.
Jenkins, test this please.

@gunan gunan added the kokoro:force-run Tests on submitted change label Nov 5, 2017
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 5, 2017
@gunan gunan merged commit 9389c25 into tensorflow:master Nov 5, 2017
@yongtang yongtang deleted the 11554-bincount-gpu branch November 5, 2017 20:14
copybara-service bot pushed a commit that referenced this pull request Jun 25, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

Merging this change closes #13813

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 233763b8efb4ab0045eb998b437c7b28c8f776c8
PiperOrigin-RevId: 646461734
copybara-service bot pushed a commit that referenced this pull request Jun 25, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

Merging this change closes #13813

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 233763b8efb4ab0045eb998b437c7b28c8f776c8
PiperOrigin-RevId: 646461734
copybara-service bot pushed a commit that referenced this pull request Jun 25, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

--
2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>:

remove un-used variables

Merging this change closes #13813

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d
PiperOrigin-RevId: 646461734
copybara-service bot pushed a commit that referenced this pull request Jun 25, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

--
2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>:

remove un-used variables

Merging this change closes #13813

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d
PiperOrigin-RevId: 646461734
copybara-service bot pushed a commit that referenced this pull request Jun 25, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

--
2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>:

remove un-used variables

Merging this change closes #13813

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13813 from Tixxx:tixxx/ag_multi_fix 2220cd1a022ad519cd23ab36c31c70c9627fc76d
PiperOrigin-RevId: 646461734
copybara-service bot pushed a commit that referenced this pull request Jun 26, 2024
Imported from GitHub PR openxla/xla#13813

gpu_windowed_einsum_handler pass has been re-using the empty buffer of the transformed while loop. This buffer is given by the spmd dot_handler pass. The shape of the buffer has changed from the allgathered shape of the sharded operand to the output shape of the dot which leads to a shape incompatibility error. To make the gpu handler completely safe, we will make a new element in the tuple to host the cached activation with the desired shape.
The slice index of where to write the slice into the full buffer also changes based on whether it's contracting or non-contracting dim is sharded. With the new element, we will need to determine the slice index ourselves in the handler pass.
Copybara import of the project:

--
ceeff8e5da8ecb3f382bbd8dee83e2f0c909b22d by TJ Xu <tjx@nvidia.com>:

Assign a fixed index for cached activation
Cache correct activation slice when contracting dim is sharded

--
233763b8efb4ab0045eb998b437c7b28c8f776c8 by TJ Xu <tjx@nvidia.com>:

Simplified logic in gpu einsum handler to be more generic

--
2220cd1a022ad519cd23ab36c31c70c9627fc76d by TJ Xu <tjx@nvidia.com>:

remove un-used variables

Merging this change closes #13813

PiperOrigin-RevId: 646666635
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes stat:awaiting response Status - Awaiting response from author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: GPU support for tf.bincount

8 participants