[determinism] Add segment reduction op exceptions for GPU determinism #47772

duncanriach · 2021-03-12T23:30:10Z

High-Level Summary

This current PR adds and tests the following functionality:

When the environment variable TF_DETERMINISTIC_OPS is set to "true" or "1", an attempt to run the following ops on a GPU will throw tf.errors.UnimplementedError (with an understandable message) when data is a floating-point type, including complex types (if supported).

tf.math.segment_prod
tf.math.segment_sum
tf.math.unsorted_segment_mean
tf.math.unsorted_segment_sqrt_n
tf.math.unsorted_segment_prod
tf.math.unsorted_segment_sum

tf.convert_to_tensor, when running on a GPU and when value is of type tf.IndexedSlides will also throw tf.errors.UnimplementedError (because it uses tf.math.unsorted_segment_sum). This is confirmed by a test included in this PR.

The output of tf.gather's backprop is of type tf.IndexedSlices. Therefore, when running on a GPU, if the output of tf.gather's backprop is passed to tf.convert_to_tensor, such as when updating a word embedding, then that will also cause tf.errors.UnimplementedError to be thrown. This is confirmed by a test included in this PR.

Please see RFC: Enhancing determinism in TF (being added via tensorflow/community PR 346).

Additional Notes

Sorted Segment Mean

tf.math.segment_mean is currently implemented on the CPU only, operates deterministically, and will not throw tf.errors.UnimplementedError. The tests included in the current PR confirm that it does not throw an exception.

Unsorted Segment Mean and Square Root

tf.math.unsorted_segment_mean and tf.math_unsorted_segment_sqrt_n are both currently implemented on top of tf.math.unsorted_segment_sum. Running these two ops on the GPU, with floating-point-based data types, will cause tf.errors.UnimplementedError to be thrown as if by tf.math.unsorted_segment_sum.

Min and Max Segment Reductions

The following ops will not throw tf.errors.UnimplementedError, even for floating-point-based types, because min and max operations are associative (ignoring potential non-commutativity and non-associativity of math corner cases). That they do not throw exceptions when run on GPU is confirmed by the tests included in this current PR.

tf.math.segment_max
tf.math.segment_min
tf.math.unsorted_segment_max
tf.math.unsorted_segment_min

Data Types

Where GPU implementations of the sorted segment (non-min/max) reduction ops exist, they support the following data types: tf.float16, tf.float32, and tf.float64. The tests confirm that tf.errors.UnimplementedError is thrown for all of these, regardless of the integer index data type.

Where GPU implementations of the unsorted segment (non-min/max) reduction ops exist, they support the following data types: tf.float16, tf.float32, tf.float64, and tf.int32 (except tf.math.unsorted_segment_sqrt_n, which does not support tf.int32). The tests confirm that tf.errors.UnimplementedError is thrown for all ops on all the floating-point types and not for any of the ops with int32, regardless of the integer index data type.

The GPU implementation of tf.math_unsorted_segment_sum (and therefore also tf.math.unsorted_segment_mean and tf.math_unsorted_segment_sqrt_n) supports tf.complex64 and tf.complex128. The tests confirm that tf.errors.UnimplementedError is thrown for these ops and datatypes, regardless of the integer index data type.

Graph Mode vs Eager Mode Testing

All the tests run in eager mode and some tests run in graph mode as well. I was not able to get all the test to work in graph mode, for reasons that seemed unrelated to the actual intentions of the tests. I didn't want to spend more time on it, and I felt that enough of the exception throwing had been shown to translate, as expected, to graph mode.

XLA

The tests that run in graph mode would also end up getting run with the XLA auto-jit enabled. This would cause tests to fail where the expected exceptions were not generated by the XLA implementations of the ops. For this reason, in this PR, XLA auto-jit is disabled. I don't know if the XLA implementations of these ops are deterministic, but I imagine that they probably would be. I propose that when we implement deterministic versions of these ops, the determinism tests (that actually check determinism) should also be run on the XLA functionality.

Dense Image Warp

tfa.image.dense_image_warp's backprop to image (but not flow) is nondeterministic on a GPU because it uses tf.gather. Therefore, after this PR, enabling determinism should cause tfa.image.dense_image_warp to throw an exception up from tf.math.unsorted_segment_sum if its backprop is run on a GPU. This behavior is not confirmed by the tests included with this current PR.

duncanriach · 2021-03-12T23:32:51Z

@sanjoy @reedwm : I'm hoping that this can get into TF 2.5

…ented tests

sanjoy · 2021-03-16T04:07:16Z

tensorflow/core/kernels/segment_reduction_ops.h

                                                        const T& value) {
    GpuAtomicAdd(dest, value);
  }
+  const bool deterministic_for_float = false;


This should say is_associative since individual floating point additions are deterministic. And I'd suggest writing this as:

constexpr bool is_associative = std::is_floating_point<T>::value;

so that the "caller" does not have to check T again.

Same for the other instances.

Very nice. Will do.

I made this into a static constexpr (class parameter). constexpr is not allowed on a non-static member of a class, presumably because it has to be assigned/initialized (even if const) when the class is instantiated (at runtime and not during compilation). This change, which I think is what we were looking for, may facilitate more compile-time elimination of some of the logic used to decide branching.

Also, I used std::is_integral<T>::value because !std::is_floating_point<T>::value will not include the complex types. I'm also more confident that all of the integral types are associative than I am confident that all types other than the floating point types (including the complex types) are associative.

tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc

sanjoy · 2021-03-16T04:10:13Z

tensorflow/core/kernels/segment_reduction_ops_impl_1.cc

  REGISTER_GPU_KERNEL_SORTEDSEGMENT(                                      \
      "SegmentSum", type, index_type, functor::Zero<type>,                \
-      functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>);   \
+      functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>,    \
+      /*deterministic_for_float=*/false);                                 \
  REGISTER_GPU_KERNEL_SORTEDSEGMENT(                                      \
      "SegmentProd", type, index_type, functor::One<type>,                \
-      functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>); \
+      functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>,  \
+      /*deterministic_for_float=*/false);                                 \
  REGISTER_GPU_KERNEL_SORTEDSEGMENT(                                      \
      "SegmentMin", type, index_type, functor::Highest<type>,             \
-      functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>);   \
+      functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>,    \
+      /*deterministic_for_float=*/true);                                  \
  REGISTER_GPU_KERNEL_SORTEDSEGMENT(                                      \
      "SegmentMax", type, index_type, functor::Lowest<type>,              \
-      functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>);
+      functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>,    \
+      /*deterministic_for_float=*/true);


Can the OpKernels infer deterministic_for_float from the AtomicFooOpGpu functors?

I wish they could do. Without completely re-writing a lot of this code, I couldn't figure out a way of doing this the same way for the unsorted ops and for the sorted ops because the code is structured differently for the two sets of ops. For the sorted ops, the reduction functor types bypass the GPU-specific code and appear in non-device-specific code. If I moved the exception-testing into that code, I would need to condition it on device. This repetition-inducing asymmetry has been bothering me too. Let me take a look at this again and see if I can clean it up using a different approach.

It looks like maybe I misunderstood something about the structure of the code. I'm attempting again to clean this up.

Okay, I'm back to understanding the actual problem that the submitted code solved. The place where we would ideally throw the exception for the sorted segment reduction ops (in order to use the is_associative state from AtomicFooOpGpu) is running asynchronous compute but doesn't have access to the appropriate DoneCallBack in order to call OP_REQUIRES_ASYNC. Trying to solve this another way ...

I think this is good enough now (with the commit I just added). Please let me know if you can see a cleaner way of doing it.

tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc

duncanriach · 2021-03-17T20:14:11Z

@gbaned: please kokoro:force-run; I fixed an Ubuntu Sanity error with the most recent commit.

duncanriach · 2021-03-17T21:06:30Z

@sanjoy: I have added a lot of additional information to the original comment. My intention is to more thoroughly document this PR both for myself and for anyone else who reviews it in the future. Please take look at it, particularly the sections on graph vs eager mode and XLA.

## 1 The following commit breaks the following unit-tests on ROCm tensorflow@92e0315 ``` //tensorflow/compiler/xla/service/gpu/tests:gpu_infeed_test FAILED in 3 out of 3 in 1.8s //tensorflow/compiler/xla/tests:local_client_execute_test_gpu FAILED in 3 out of 3 in 13.7s //tensorflow/compiler/xla/tests:outfeed_in_nested_computation_test_gpu FAILED in 3 out of 3 in 2.0s //tensorflow/compiler/xla/tests:while_test_gpu FAILED in 3 out of 3 in 9.9s ``` The cause is that part of the changes were not enabled for ROCm, and this commit fixes that. ## 2 The following commit breaks the following unit-test on ROCm tensorflow#47772 ``` //tensorflow/python/kernel_tests:segment_reduction_ops_deterministic_test_gpu FAILED in 3 out of 3 in 5.3s ``` ROCm platform does not yet support complex datatype for the segment reduction ops, which is what leads to the failure. This commit modifies the testcase to skip testing the complex datatype on ROCm

duncanriach · 2021-08-19T19:24:19Z

Note that non-sparse GPU-deterministic segment reduction ops, both sorted and unsorted, will be added by TF 51392, probably then appearing in release 2.7. The aforementioned PR also adds a (deterministic) GPU implementation of tf.math.segment_mean (which hitherto was implemented on CPU only).

[determinism] Add segment reduction op exceptions

de45867

google-ml-butler bot added the size:L CL Change Size: Large label Mar 12, 2021

google-cla bot added the cla: yes label Mar 12, 2021

gbaned self-assigned this Mar 14, 2021

gbaned requested a review from sanjoy March 14, 2021 18:22

[determinism] Disable XLA auto-jit for segment reduction d9m-unimplem…

56c34b2

…ented tests

sanjoy suggested changes Mar 16, 2021

View reviewed changes

[determinism] Address review, step 1, on PR 47772

b61a34d

duncanriach requested a review from sanjoy March 17, 2021 00:31

sanjoy approved these changes Mar 17, 2021

View reviewed changes

tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc Show resolved Hide resolved

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 17, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 17, 2021

gbaned added the kokoro:force-run Tests on submitted change label Mar 17, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 17, 2021

[determinism] Fix buildifier error

71d777c

google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 17, 2021

gbaned requested a review from sanjoy March 18, 2021 03:36

sanjoy approved these changes Mar 18, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 18, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 18, 2021

copybara-service bot merged commit 5168a84 into tensorflow:master Mar 18, 2021

duncanriach mentioned this pull request Mar 19, 2021

More segment ops need to be patched NVIDIA/framework-reproducibility#31

Closed

deven-amd mentioned this pull request Mar 19, 2021

[ROCm] Fix for ROCm CSB breakages - 210319 #47919

Merged

duncanriach mentioned this pull request Apr 16, 2021

Non-deterministic behaviour: tf.math.unsorted_segment_sum uses CUDA Atomic Operations #39751

Closed

duncanriach deleted the segment-sum-nond9m-exceptions branch April 19, 2021 22:39

duncanriach mentioned this pull request Jun 8, 2021

[determinism] Factor core/kernels RequireDeterminism() into library #50135

Merged

duncanriach mentioned this pull request Jul 7, 2021

Don't get deterministic results with NGC 20.09 NVIDIA/framework-reproducibility#29

Open

[determinism] Add segment reduction op exceptions for GPU determinism #47772

[determinism] Add segment reduction op exceptions for GPU determinism #47772

Conversation

duncanriach commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High-Level Summary

Additional Notes

Sorted Segment Mean

Unsorted Segment Mean and Square Root

Min and Max Segment Reductions

Data Types

Graph Mode vs Eager Mode Testing

XLA

Dense Image Warp

Uh oh!

duncanriach commented Mar 12, 2021

Uh oh!

sanjoy Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanjoy Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

duncanriach commented Mar 17, 2021

Uh oh!

duncanriach commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duncanriach commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

duncanriach commented Mar 12, 2021 •

edited

Loading

duncanriach Mar 17, 2021 •

edited

Loading

duncanriach Mar 16, 2021 •

edited

Loading

duncanriach Mar 17, 2021 •

edited

Loading

duncanriach commented Mar 17, 2021 •

edited

Loading