-
Notifications
You must be signed in to change notification settings - Fork 75.2k
[determinism] Add segment reduction op exceptions for GPU determinism #47772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[determinism] Add segment reduction op exceptions for GPU determinism #47772
Conversation
| const T& value) { | ||
| GpuAtomicAdd(dest, value); | ||
| } | ||
| const bool deterministic_for_float = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say is_associative since individual floating point additions are deterministic. And I'd suggest writing this as:
constexpr bool is_associative = std::is_floating_point<T>::value;
so that the "caller" does not have to check T again.
Same for the other instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this into a static constexpr (class parameter). constexpr is not allowed on a non-static member of a class, presumably because it has to be assigned/initialized (even if const) when the class is instantiated (at runtime and not during compilation). This change, which I think is what we were looking for, may facilitate more compile-time elimination of some of the logic used to decide branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I used std::is_integral<T>::value because !std::is_floating_point<T>::value will not include the complex types. I'm also more confident that all of the integral types are associative than I am confident that all types other than the floating point types (including the complex types) are associative.
| REGISTER_GPU_KERNEL_SORTEDSEGMENT( \ | ||
| "SegmentSum", type, index_type, functor::Zero<type>, \ | ||
| functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>); \ | ||
| functor::NonAtomicSumOpGpu<type>, functor::AtomicSumOpGpu<type>, \ | ||
| /*deterministic_for_float=*/false); \ | ||
| REGISTER_GPU_KERNEL_SORTEDSEGMENT( \ | ||
| "SegmentProd", type, index_type, functor::One<type>, \ | ||
| functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>); \ | ||
| functor::NonAtomicProdOpGpu<type>, functor::AtomicProdOpGpu<type>, \ | ||
| /*deterministic_for_float=*/false); \ | ||
| REGISTER_GPU_KERNEL_SORTEDSEGMENT( \ | ||
| "SegmentMin", type, index_type, functor::Highest<type>, \ | ||
| functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>); \ | ||
| functor::NonAtomicMinOpGpu<type>, functor::AtomicMinOpGpu<type>, \ | ||
| /*deterministic_for_float=*/true); \ | ||
| REGISTER_GPU_KERNEL_SORTEDSEGMENT( \ | ||
| "SegmentMax", type, index_type, functor::Lowest<type>, \ | ||
| functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>); | ||
| functor::NonAtomicMaxOpGpu<type>, functor::AtomicMaxOpGpu<type>, \ | ||
| /*deterministic_for_float=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the OpKernels infer deterministic_for_float from the AtomicFooOpGpu functors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish they could do. Without completely re-writing a lot of this code, I couldn't figure out a way of doing this the same way for the unsorted ops and for the sorted ops because the code is structured differently for the two sets of ops. For the sorted ops, the reduction functor types bypass the GPU-specific code and appear in non-device-specific code. If I moved the exception-testing into that code, I would need to condition it on device. This repetition-inducing asymmetry has been bothering me too. Let me take a look at this again and see if I can clean it up using a different approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like maybe I misunderstood something about the structure of the code. I'm attempting again to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm back to understanding the actual problem that the submitted code solved. The place where we would ideally throw the exception for the sorted segment reduction ops (in order to use the is_associative state from AtomicFooOpGpu) is running asynchronous compute but doesn't have access to the appropriate DoneCallBack in order to call OP_REQUIRES_ASYNC. Trying to solve this another way ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough now (with the commit I just added). Please let me know if you can see a cleaner way of doing it.
|
@gbaned: please |
|
@sanjoy: I have added a lot of additional information to the original comment. My intention is to more thoroughly document this PR both for myself and for anyone else who reviews it in the future. Please take look at it, particularly the sections on graph vs eager mode and XLA. |
## 1 The following commit breaks the following unit-tests on ROCm tensorflow@92e0315 ``` //tensorflow/compiler/xla/service/gpu/tests:gpu_infeed_test FAILED in 3 out of 3 in 1.8s //tensorflow/compiler/xla/tests:local_client_execute_test_gpu FAILED in 3 out of 3 in 13.7s //tensorflow/compiler/xla/tests:outfeed_in_nested_computation_test_gpu FAILED in 3 out of 3 in 2.0s //tensorflow/compiler/xla/tests:while_test_gpu FAILED in 3 out of 3 in 9.9s ``` The cause is that part of the changes were not enabled for ROCm, and this commit fixes that. ## 2 The following commit breaks the following unit-test on ROCm tensorflow#47772 ``` //tensorflow/python/kernel_tests:segment_reduction_ops_deterministic_test_gpu FAILED in 3 out of 3 in 5.3s ``` ROCm platform does not yet support complex datatype for the segment reduction ops, which is what leads to the failure. This commit modifies the testcase to skip testing the complex datatype on ROCm
## 1 The following commit breaks the following unit-tests on ROCm tensorflow@92e0315 ``` //tensorflow/compiler/xla/service/gpu/tests:gpu_infeed_test FAILED in 3 out of 3 in 1.8s //tensorflow/compiler/xla/tests:local_client_execute_test_gpu FAILED in 3 out of 3 in 13.7s //tensorflow/compiler/xla/tests:outfeed_in_nested_computation_test_gpu FAILED in 3 out of 3 in 2.0s //tensorflow/compiler/xla/tests:while_test_gpu FAILED in 3 out of 3 in 9.9s ``` The cause is that part of the changes were not enabled for ROCm, and this commit fixes that. ## 2 The following commit breaks the following unit-test on ROCm tensorflow#47772 ``` //tensorflow/python/kernel_tests:segment_reduction_ops_deterministic_test_gpu FAILED in 3 out of 3 in 5.3s ``` ROCm platform does not yet support complex datatype for the segment reduction ops, which is what leads to the failure. This commit modifies the testcase to skip testing the complex datatype on ROCm
|
Note that non-sparse GPU-deterministic segment reduction ops, both sorted and unsorted, will be added by TF 51392, probably then appearing in release 2.7. The aforementioned PR also adds a (deterministic) GPU implementation of |
High-Level Summary
This current PR adds and tests the following functionality:
When the environment variable
TF_DETERMINISTIC_OPSis set to"true"or"1", an attempt to run the following ops on a GPU will throwtf.errors.UnimplementedError(with an understandable message) whendatais a floating-point type, including complex types (if supported).tf.math.segment_prodtf.math.segment_sumtf.math.unsorted_segment_meantf.math.unsorted_segment_sqrt_ntf.math.unsorted_segment_prodtf.math.unsorted_segment_sumtf.convert_to_tensor, when running on a GPU and whenvalueis of typetf.IndexedSlideswill also throwtf.errors.UnimplementedError(because it usestf.math.unsorted_segment_sum). This is confirmed by a test included in this PR.The output of
tf.gather's backprop is of typetf.IndexedSlices. Therefore, when running on a GPU, if the output oftf.gather's backprop is passed totf.convert_to_tensor, such as when updating a word embedding, then that will also causetf.errors.UnimplementedErrorto be thrown. This is confirmed by a test included in this PR.Please see RFC: Enhancing determinism in TF (being added via tensorflow/community PR 346).
Additional Notes
Sorted Segment Mean
tf.math.segment_meanis currently implemented on the CPU only, operates deterministically, and will not throwtf.errors.UnimplementedError. The tests included in the current PR confirm that it does not throw an exception.Unsorted Segment Mean and Square Root
tf.math.unsorted_segment_meanandtf.math_unsorted_segment_sqrt_nare both currently implemented on top oftf.math.unsorted_segment_sum. Running these two ops on the GPU, with floating-point-based data types, will causetf.errors.UnimplementedErrorto be thrown as if bytf.math.unsorted_segment_sum.Min and Max Segment Reductions
The following ops will not throw
tf.errors.UnimplementedError, even for floating-point-based types, because min and max operations are associative (ignoring potential non-commutativity and non-associativity of math corner cases). That they do not throw exceptions when run on GPU is confirmed by the tests included in this current PR.tf.math.segment_maxtf.math.segment_mintf.math.unsorted_segment_maxtf.math.unsorted_segment_minData Types
Where GPU implementations of the sorted segment (non-min/max) reduction ops exist, they support the following data types:
tf.float16,tf.float32, andtf.float64. The tests confirm thattf.errors.UnimplementedErroris thrown for all of these, regardless of the integer index data type.Where GPU implementations of the unsorted segment (non-min/max) reduction ops exist, they support the following data types:
tf.float16,tf.float32,tf.float64, andtf.int32(excepttf.math.unsorted_segment_sqrt_n, which does not supporttf.int32). The tests confirm thattf.errors.UnimplementedErroris thrown for all ops on all the floating-point types and not for any of the ops withint32, regardless of the integer index data type.The GPU implementation of
tf.math_unsorted_segment_sum(and therefore alsotf.math.unsorted_segment_meanandtf.math_unsorted_segment_sqrt_n) supportstf.complex64andtf.complex128. The tests confirm thattf.errors.UnimplementedErroris thrown for these ops and datatypes, regardless of the integer index data type.Graph Mode vs Eager Mode Testing
All the tests run in eager mode and some tests run in graph mode as well. I was not able to get all the test to work in graph mode, for reasons that seemed unrelated to the actual intentions of the tests. I didn't want to spend more time on it, and I felt that enough of the exception throwing had been shown to translate, as expected, to graph mode.
XLA
The tests that run in graph mode would also end up getting run with the XLA auto-jit enabled. This would cause tests to fail where the expected exceptions were not generated by the XLA implementations of the ops. For this reason, in this PR, XLA auto-jit is disabled. I don't know if the XLA implementations of these ops are deterministic, but I imagine that they probably would be. I propose that when we implement deterministic versions of these ops, the determinism tests (that actually check determinism) should also be run on the XLA functionality.
Dense Image Warp
tfa.image.dense_image_warp's backprop toimage(but notflow) is nondeterministic on a GPU because it usestf.gather. Therefore, after this PR, enabling determinism should causetfa.image.dense_image_warpto throw an exception up fromtf.math.unsorted_segment_sumif its backprop is run on a GPU. This behavior is not confirmed by the tests included with this current PR.