Skip to content

Conversation

@duncanriach
Copy link
Contributor

@duncanriach duncanriach commented Mar 12, 2021

High-Level Summary

This current PR adds and tests the following functionality:

When the environment variable TF_DETERMINISTIC_OPS is set to "true" or "1", an attempt to run the following ops on a GPU will throw tf.errors.UnimplementedError (with an understandable message) when data is a floating-point type, including complex types (if supported).

tf.math.segment_prod
tf.math.segment_sum
tf.math.unsorted_segment_mean
tf.math.unsorted_segment_sqrt_n
tf.math.unsorted_segment_prod
tf.math.unsorted_segment_sum

tf.convert_to_tensor, when running on a GPU and when value is of type tf.IndexedSlides will also throw tf.errors.UnimplementedError (because it uses tf.math.unsorted_segment_sum). This is confirmed by a test included in this PR.

The output of tf.gather's backprop is of type tf.IndexedSlices. Therefore, when running on a GPU, if the output of tf.gather's backprop is passed to tf.convert_to_tensor, such as when updating a word embedding, then that will also cause tf.errors.UnimplementedError to be thrown. This is confirmed by a test included in this PR.

Please see RFC: Enhancing determinism in TF (being added via tensorflow/community PR 346).

Additional Notes

Sorted Segment Mean

tf.math.segment_mean is currently implemented on the CPU only, operates deterministically, and will not throw tf.errors.UnimplementedError. The tests included in the current PR confirm that it does not throw an exception.

Unsorted Segment Mean and Square Root

tf.math.unsorted_segment_mean and tf.math_unsorted_segment_sqrt_n are both currently implemented on top of tf.math.unsorted_segment_sum. Running these two ops on the GPU, with floating-point-based data types, will cause tf.errors.UnimplementedError to be thrown as if by tf.math.unsorted_segment_sum.

Min and Max Segment Reductions

The following ops will not throw tf.errors.UnimplementedError, even for floating-point-based types, because min and max operations are associative (ignoring potential non-commutativity and non-associativity of math corner cases). That they do not throw exceptions when run on GPU is confirmed by the tests included in this current PR.

  • tf.math.segment_max
  • tf.math.segment_min
  • tf.math.unsorted_segment_max
  • tf.math.unsorted_segment_min

Data Types

Where GPU implementations of the sorted segment (non-min/max) reduction ops exist, they support the following data types: tf.float16, tf.float32, and tf.float64. The tests confirm that tf.errors.UnimplementedError is thrown for all of these, regardless of the integer index data type.

Where GPU implementations of the unsorted segment (non-min/max) reduction ops exist, they support the following data types: tf.float16, tf.float32, tf.float64, and tf.int32 (except tf.math.unsorted_segment_sqrt_n, which does not support tf.int32). The tests confirm that tf.errors.UnimplementedError is thrown for all ops on all the floating-point types and not for any of the ops with int32, regardless of the integer index data type.

The GPU implementation of tf.math_unsorted_segment_sum (and therefore also tf.math.unsorted_segment_mean and tf.math_unsorted_segment_sqrt_n) supports tf.complex64 and tf.complex128. The tests confirm that tf.errors.UnimplementedError is thrown for these ops and datatypes, regardless of the integer index data type.

Graph Mode vs Eager Mode Testing

All the tests run in eager mode and some tests run in graph mode as well. I was not able to get all the test to work in graph mode, for reasons that seemed unrelated to the actual intentions of the tests. I didn't want to spend more time on it, and I felt that enough of the exception throwing had been shown to translate, as expected, to graph mode.

XLA

The tests that run in graph mode would also end up getting run with the XLA auto-jit enabled. This would cause tests to fail where the expected exceptions were not generated by the XLA implementations of the ops. For this reason, in this PR, XLA auto-jit is disabled. I don't know if the XLA implementations of these ops are deterministic, but I imagine that they probably would be. I propose that when we implement deterministic versions of these ops, the determinism tests (that actually check determinism) should also be run on the XLA functionality.

Dense Image Warp

tfa.image.dense_image_warp's backprop to image (but not flow) is nondeterministic on a GPU because it uses tf.gather. Therefore, after this PR, enabling determinism should cause tfa.image.dense_image_warp to throw an exception up from tf.math.unsorted_segment_sum if its backprop is run on a GPU. This behavior is not confirmed by the tests included with this current PR.

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Mar 12, 2021
@google-cla google-cla bot added the cla: yes label Mar 12, 2021
@duncanriach
Copy link
Contributor Author

@sanjoy @reedwm : I'm hoping that this can get into TF 2.5

@gbaned gbaned self-assigned this Mar 14, 2021
@gbaned gbaned requested a review from sanjoy March 14, 2021 18:22
const T& value) {
GpuAtomicAdd(dest, value);
}
const bool deterministic_for_float = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say is_associative since individual floating point additions are deterministic. And I'd suggest writing this as:

constexpr bool is_associative = std::is_floating_point<T>::value;

so that the "caller" does not have to check T again.

Same for the other instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Will do.

Copy link
Contributor Author

@duncanriach duncanriach Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this into a static constexpr (class parameter). constexpr is not allowed on a non-static member of a class, presumably because it has to be assigned/initialized (even if const) when the class is instantiated (at runtime and not during compilation). This change, which I think is what we were looking for, may facilitate more compile-time elimination of some of the logic used to decide branching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I used std::is_integral<T>::value because !std::is_floating_point<T>::value will not include the complex types. I'm also more confident that all of the integral types are associative than I am confident that all types other than the floating point types (including the complex types) are associative.

Comment on lines 132 to 147
REGISTER_GPU_KERNEL_SORTEDSEGMENT( \
"SegmentSum", type, index_type, functor::Zero<type>, \
functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>); \
functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>, \
/*deterministic_for_float=*/false); \
REGISTER_GPU_KERNEL_SORTEDSEGMENT( \
"SegmentProd", type, index_type, functor::One<type>, \
functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>); \
functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>, \
/*deterministic_for_float=*/false); \
REGISTER_GPU_KERNEL_SORTEDSEGMENT( \
"SegmentMin", type, index_type, functor::Highest<type>, \
functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>); \
functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>, \
/*deterministic_for_float=*/true); \
REGISTER_GPU_KERNEL_SORTEDSEGMENT( \
"SegmentMax", type, index_type, functor::Lowest<type>, \
functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>);
functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>, \
/*deterministic_for_float=*/true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the OpKernels infer deterministic_for_float from the AtomicFooOpGpu functors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish they could do. Without completely re-writing a lot of this code, I couldn't figure out a way of doing this the same way for the unsorted ops and for the sorted ops because the code is structured differently for the two sets of ops. For the sorted ops, the reduction functor types bypass the GPU-specific code and appear in non-device-specific code. If I moved the exception-testing into that code, I would need to condition it on device. This repetition-inducing asymmetry has been bothering me too. Let me take a look at this again and see if I can clean it up using a different approach.

Copy link
Contributor Author

@duncanriach duncanriach Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like maybe I misunderstood something about the structure of the code. I'm attempting again to clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm back to understanding the actual problem that the submitted code solved. The place where we would ideally throw the exception for the sorted segment reduction ops (in order to use the is_associative state from AtomicFooOpGpu) is running asynchronous compute but doesn't have access to the appropriate DoneCallBack in order to call OP_REQUIRES_ASYNC. Trying to solve this another way ...

Copy link
Contributor Author

@duncanriach duncanriach Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough now (with the commit I just added). Please let me know if you can see a cleaner way of doing it.

@duncanriach duncanriach requested a review from sanjoy March 17, 2021 00:31
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 17, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 17, 2021
@gbaned gbaned added the kokoro:force-run Tests on submitted change label Mar 17, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 17, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 17, 2021
@duncanriach
Copy link
Contributor Author

@gbaned: please kokoro:force-run; I fixed an Ubuntu Sanity error with the most recent commit.

@duncanriach
Copy link
Contributor Author

duncanriach commented Mar 17, 2021

@sanjoy: I have added a lot of additional information to the original comment. My intention is to more thoroughly document this PR both for myself and for anyone else who reviews it in the future. Please take look at it, particularly the sections on graph vs eager mode and XLA.

@gbaned gbaned requested a review from sanjoy March 18, 2021 03:36
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 18, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 18, 2021
@copybara-service copybara-service bot merged commit 5168a84 into tensorflow:master Mar 18, 2021
deven-amd added a commit to ROCm/tensorflow-upstream that referenced this pull request Mar 19, 2021
## 1

The following commit breaks the following unit-tests on ROCm

tensorflow@92e0315

```
//tensorflow/compiler/xla/service/gpu/tests:gpu_infeed_test              FAILED in 3 out of 3 in 1.8s
//tensorflow/compiler/xla/tests:local_client_execute_test_gpu            FAILED in 3 out of 3 in 13.7s
//tensorflow/compiler/xla/tests:outfeed_in_nested_computation_test_gpu   FAILED in 3 out of 3 in 2.0s
//tensorflow/compiler/xla/tests:while_test_gpu                           FAILED in 3 out of 3 in 9.9s
```

The cause is that part of the changes were not enabled for ROCm, and this commit fixes that.

## 2

The following commit breaks the following unit-test on ROCm

tensorflow#47772

```
//tensorflow/python/kernel_tests:segment_reduction_ops_deterministic_test_gpu FAILED in 3 out of 3 in 5.3s
```

ROCm platform does not yet support complex datatype for the segment reduction ops, which is what leads to the failure.

This commit modifies the testcase to skip testing the complex datatype on ROCm
deven-amd added a commit to ROCm/tensorflow-upstream that referenced this pull request Mar 22, 2021
## 1

The following commit breaks the following unit-tests on ROCm

tensorflow@92e0315

```
//tensorflow/compiler/xla/service/gpu/tests:gpu_infeed_test              FAILED in 3 out of 3 in 1.8s
//tensorflow/compiler/xla/tests:local_client_execute_test_gpu            FAILED in 3 out of 3 in 13.7s
//tensorflow/compiler/xla/tests:outfeed_in_nested_computation_test_gpu   FAILED in 3 out of 3 in 2.0s
//tensorflow/compiler/xla/tests:while_test_gpu                           FAILED in 3 out of 3 in 9.9s
```

The cause is that part of the changes were not enabled for ROCm, and this commit fixes that.

## 2

The following commit breaks the following unit-test on ROCm

tensorflow#47772

```
//tensorflow/python/kernel_tests:segment_reduction_ops_deterministic_test_gpu FAILED in 3 out of 3 in 5.3s
```

ROCm platform does not yet support complex datatype for the segment reduction ops, which is what leads to the failure.

This commit modifies the testcase to skip testing the complex datatype on ROCm
@duncanriach
Copy link
Contributor Author

Note that non-sparse GPU-deterministic segment reduction ops, both sorted and unsorted, will be added by TF 51392, probably then appearing in release 2.7. The aforementioned PR also adds a (deterministic) GPU implementation of tf.math.segment_mean (which hitherto was implemented on CPU only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes ready to pull PR ready for merge process size:L CL Change Size: Large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants