-
Notifications
You must be signed in to change notification settings - Fork 75.2k
[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul #47749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul #47749
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
8541446 to
bf5ce24
Compare
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
bf5ce24 to
b2bffac
Compare
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
|
Typo in PR title: determinisim -> determinism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First-pass review completed. I'm happy to chat with you, one-on-one, about any of this.
After you have addressed these comments, I'll review again, looking deeper.
| with self.session(force_gpu=True): | ||
|
|
||
| if data_type in unimplemented_types: | ||
| with self.assertRaises(errors.UnimplementedError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use self.assertRaisesRegex to check the exception message.
| return _timeit | ||
|
|
||
|
|
||
| def sparse_tensor_dense_vs_dense_matmul_benchmark(thresh, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this function, and the previous two private functions that it calls, should be left in their original location in sparse_tensor_dense_matmul_op_test.py. They should be kept with the code that calls them, to minimize complexity. I don't see any reason to put them in the base file and make them available to the deterministic tests.
| @@ -1,4 +1,4 @@ | |||
| # Copyright 2015 The TensorFlow Authors. All Rights Reserved. | |||
| # Copyright 2021 The TensorFlow Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to 2015-2021 after moving the benchmark code back in here.
| /*default_val=*/false, | ||
| &deterministic_ops)); | ||
| if (deterministic_ops) { | ||
| OP_REQUIRES( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should not be run here, because it will run on either a CPU or a GPU. You need to find a place in the codepath (in the CUDA kernel launch code, and not inside the CUDA kernel) where the code will definitely only be running on a GPU.
It seems as though you've added the integral type check to prevent exceptions being thrown on CPU. This will not cover the case where the op is forced to run on CPU, and the type is float64 or complex128, which presumably do operate deterministically on the CPU. Therefore this code will throw incorrect exceptions on CPU.
By making this check GPU-only, it will simplify it, and reduce complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a TF standard to put OP_REQUIRES in the *.cc file instead of *.cu.cc otherwise compiling error occurs. Reason is, the functor::compute in *.cu.cc returns a Status but compute in *.cc returns void. If there is a OP_REQUIRES branch in the functor::compute, a return value is mandate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, depending on the way the op is coded, to use OP_REQUIRES and OP_REQUIRES_ASYNC in the CUDA kernel launch code in the *.cu.cc file, and some ops do this. It's also possible to add the error to the context, as some other ops do. If, as in this particular op, the signature of the per-device implementation functor returns Status, then it suggests that we need to work with that pattern (set an error as well as, or via, returning a Status). The objectives have to be met while working within reasonable constraints, but we should not default to introducing bugs, reducing performance significantly, or introducing unnecessary complexity because one apparent pattern seems to conflict with another apparent pattern. It's sometimes necessary to wrestle a little bit longer with this kind of thing and/or seek consultation in order to meet the requirements of what needs to be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One part-way solution, could be to put the OP_REQUIRES into the existing if (std::is_same<Device, GPUDevice>::value) block in the *.cc file; this would simplify and accelerate. The RequireDeterminism function could be defined on the *.cu.cc file, declared in a shared header, and reused. However, I think it's better (simpler/faster) to check for RequireDeterminism() once, in one place (in the *.cu.cc file), assuming a good way to do that can be figured out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like maybe we just have to return errors:Unimplemented(...).
Looking at, for example, barrier_ops.cc, that contains functions which return Status and throw exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the status is passed up the stack like that, a higher level handles it (with OP_REQUIRES_OK), putting it into the context. If it's not passed up (as in functions that return void) the OP_REQUIRES puts it into the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary: we either throw the exception or we pass it up via returned Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| errors::InvalidArgument( | ||
| "Cannot use GPU when output.shape[1] * nnz(a) > 2^31")); | ||
| bool deterministic_ops = false; | ||
| TF_CHECK_OK(ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be checked every time the op compute runs. Please follow the existing pattern by caching the environment variable in a static variable inside a function.
Given that this check will/should move down into the _gpu.cu.cc code, I don't think you'll need to do it twice, you can do the exception check/throw in the same place as choosing the deterministic functionality. So this pattern here should be eliminated.
| EIGEN_ALWAYS_INLINE EIGEN_DEVICE_FUNC bool DeterminismTypesCheck() { | ||
| return TF_PREDICT_TRUE( | ||
| std::is_integral<T>::value || | ||
| (!std::is_same<T, double>::value && !std::is_same<T, complex128>::value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integral check should be eliminated.
In the kernel-launching code, in the RequireDeterminism() == true branch, there are two options: (1) if std::is_same<T, double>::value || std::is_same<T, complex128>::value then throw the exception, else (2) do the deterministic kernel launch.
That should not need to be put in a function or inlined, and I think it's going to be really easy to understand.
|
@googlebot I signed it! |
duncanriach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have completed a second review pass. Please make changes and then I will review again, possibly going deeper.
|
|
||
| template <typename Tsrc, typename Tdst> | ||
| __global__ void DownCast( | ||
| const int size, const Tsrc* src, Tdst* __restrict__ dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me if the compiler's default assumption of pointer aliasing could slow down the operation in these DownCast functions (but it's better to be safe and to let it know that there is no pointer aliasing), however, to avoid pointer aliasing, doesn't __restrict__ have to be applied to both src and dst? I just reviewed this blog post (again).
|
|
||
| with self.assertRaisesRegex( | ||
| errors.UnimplementedError, | ||
| "No deterministic GPU implementation of *"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reason for not checking that all the words/characters in the message are correct?
To me, it seems better to check it completely. If the message changes, either intentionally or accidentally, this test will then fail, and it will bring conscious attention to this being tested.
Note that self.assertRaisesRegex can test against a regular expression (as you have used it), but it can also be used to compare against a string literal (a type of non-variable regular expression).
| required_determinisim = \ | ||
| True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False | ||
|
|
||
| unimplemented_types = (np.float64, np.complex128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this variable name should be more explicit because in the context of this test code, "unimplemented types" means that matmul is not implemented at all for those types, which is not what this variable represents. What this variable is capturing is the types for which GPU-determinism is not implemented. Therefore, an unambiguous name for it would be gpu_determinism_unimplemented_types, otherwise its meaning can only be discerned by carefully examining how it's later used, and, ideally, we want to be able to read and understand code without having to decipher variable meaning by mentally simulating variable use.
Variables should be named such that, in the contexts in which they appear, the meaning of their contents is unambiguous.
| def _sparse_tensor_dense_vs_dense_matmul_benchmark_dense(x, y, adjoint_a, | ||
| adjoint_b): | ||
|
|
||
| def body(t, prev): | ||
| with ops.control_dependencies([prev]): | ||
| return (t + 1, math_ops.matmul( | ||
| x, | ||
| y, | ||
| transpose_a=adjoint_a, | ||
| transpose_b=adjoint_b, | ||
| a_is_sparse=True, | ||
| b_is_sparse=False)) | ||
|
|
||
| t0 = constant_op.constant(0) | ||
| v0 = constant_op.constant(0.0) | ||
|
|
||
| def _timeit(iterations, _): | ||
| (_, final) = control_flow_ops.while_loop( | ||
| lambda t, _: t < iterations, | ||
| body, (t0, v0), | ||
| parallel_iterations=1, | ||
| back_prop=False, | ||
| shape_invariants=(tensor_shape.TensorShape(()), | ||
| tensor_shape.TensorShape(None))) | ||
| return [final] | ||
|
|
||
| return _timeit | ||
|
|
||
|
|
||
| def _sparse_tensor_dense_vs_dense_matmul_benchmark_sparse(x_ind, x_val, x_shape, | ||
| y, adjoint_a, | ||
| adjoint_b): | ||
| sp_x = sparse_tensor.SparseTensor( | ||
| indices=x_ind, values=x_val, dense_shape=x_shape) | ||
|
|
||
| def body(t, prev): | ||
| with ops.control_dependencies([prev]): | ||
| return (t + 1, sparse_ops.sparse_tensor_dense_matmul( | ||
| sp_x, y, adjoint_a=adjoint_a, adjoint_b=adjoint_b)) | ||
|
|
||
| t0 = constant_op.constant(0) | ||
| v0 = constant_op.constant(0.0) | ||
|
|
||
| def _timeit(iterations, _): | ||
| (_, final) = control_flow_ops.while_loop( | ||
| lambda t, _: t < iterations, | ||
| body, (t0, v0), | ||
| parallel_iterations=1, | ||
| back_prop=False, | ||
| shape_invariants=(tensor_shape.TensorShape(()), | ||
| tensor_shape.TensorShape(None))) | ||
| return [final] | ||
|
|
||
| return _timeit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to this comment in my previous review cycle. These two functions are essentially subroutines of the benchmark function, which is not referenced in the base file or the deterministic test file at all. These functions should be returned to their original locations in sparse_tensor_dense_matmul_op_test.py.
| required_determinisim = \ | ||
| True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to section 3.2 (Line Length) of the Google Python Style Guide: "Do not use backslash line continuation except for with statements requiring three or more context managers. Make use of Python’s implicit line joining inside parentheses, brackets and braces. If necessary, you can add an extra pair of parentheses around an expression."
There is another instance of incorrect use of backslash for line continuation in the code below.
Also this variable name includes an incorrect spelling of "determinism." Please call it, determinism_required.
| GPU_1D_KERNEL_LOOP(index, size) { | ||
| dst[index] = static_cast<Tdst>(src[index]); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line of code should be implemented with multiples of two (not four) spaces: outer=2, inner=4.
| DataTypeToEnum<Tupcast>::value, | ||
| TensorShape({out.dimension(0), out.dimension(1)}), &temp_out_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code should be indented with four spaces, not five.
| SparseTensorDenseMatMulKernel<T, Tupcast, Tindices, ADJ_A, ADJ_B>, | ||
| config.block_count, config.thread_per_block, 0, d.stream(), nnz, m, | ||
| b_rows, b_cols, p, a_indices.data(), a_values.data(), b.data(), | ||
| maybe_temp_out_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code should be indented with four spaces, not three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please will you review all the code that you're adding and make sure that it follows the intending rules. Each level of statement indent should add two spaces. Starting a list of parameters on a new line [ending the previous line with a "("] should indent four spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might seem pedantic, but it's important to generate code in a standard way to make it easy to read and maintain. Search for, and review, all 30 instances of the word "indent" in the Google C++ Style Guide.
| cuda_py_test( | ||
| name = "sparse_tensor_dense_matmul_op_deterministic_test", | ||
| size = "small", | ||
| srcs = ["sparse_tensor_dense_matmul_op_deterministic_test.py"], | ||
| deps = [ | ||
| ":sparse_tensor_dense_matmul_op_base", | ||
| ], | ||
| ) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| /* Copyright 2021 The TensorFlow Authors. All Rights Reserved. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these three blank lines should be removed from the copyright header. They're part of the standard copyright header format/structure.
| bool RequireDeterminism() { | ||
| static bool require_determinism = [] { | ||
| bool deterministic_ops = false; | ||
| TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS", | ||
| /*default_val=*/false, | ||
| &deterministic_ops)); | ||
| return deterministic_ops; | ||
| }(); | ||
| return require_determinism; | ||
| } | ||
|
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third review pass completed. Please review my comments and revise. Let me know when you're done by requesting a re-review. I'm happy to discuss any of this with you.
| } | ||
|
|
||
| template <typename Tout, typename Tin> | ||
| __device__ void reduction(Tout* out_location, Tin val){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should always be a space between the ) and the {. See here. I don't know if these kind of formatting issues are currently auto-fixed, break a CI lint check, or just slip through, but I think it's worth fixing them up-front. Please check that this formatting error is not repeated elsewhere in your PR. I see it in the next function definition, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has been addressed.
tensorflow/python/kernel_tests/BUILD
Outdated
| name = "sparse_tensor_dense_matmul_op_test", | ||
| size = "medium", | ||
| srcs = ["sparse_tensor_dense_matmul_op_test.py"], | ||
| tfrt_enabled = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I'm wondering how this got into the PR in the first place.
| } | ||
|
|
||
| template<> | ||
| __device__ void reduction(complex128* out_location, complex64 val){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function names should start with a capital letter; see here.
Reduction is a relatively generic and non-descriptive noun. What does this function actually do? It re-casts and then adds. Following the existing name, how about GpuRecastAtomicAdd? This function recast-adds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has been addressed.
| GpuAtomicAdd(out_location, | ||
| static_cast<Tsum>(a_value) * static_cast<Tsum>(b_value)); | ||
|
|
||
| reduction<Tsum, T>(out_location, a_value * b_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that you haven't replaced the call to GpuAtomicAdd earlier in this function. Do you have a reason for or against that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier GpuAtomicAdd doesn't need to do the recast because the template passed in is Tsum already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see and understand that now. Thank you.
| // Note: The reinterpret cast is only required to avoid a compilation | ||
| // error; it is only used if Tsum == T. | ||
| maybe_temp_out_data = reinterpret_cast<Tsum*>(out.data()); | ||
| out.device(d) = out.constant(T(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this code is using this slower form of zeroing. Perhaps this should be made to run faster (?) by using SetZero (as in the other branch of the conditional).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just nobody catches that. Fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has now been addressed.
| maybe_temp_out_data)); | ||
|
|
||
| if (sum_type_is_different) { | ||
| out.device(d) = temp_out_t.matrix<Tsum>().template cast<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this line of code basically the same as our DownCast function, but possibly slower?
Isn't the only difference between the deterministic and nondeterministic codepaths in the choice of Tsum?
Could this code be re-written more simply to run exactly the same way for both regular and deterministic operation except that for regular operation, Tsum = typename SumType<T>::type while for deterministic operation, Tsum = typename SumType<T>::type_for_determinism?
Maybe I'm missing some other difference between the two code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo my earlier thought. This line of code is designed to handle fp16 particularly. For complex type, temp_out_t.matrix<Tsum>().template cast<T>() cannot handle properly and has to go through our DownCast kernel. But we do can uniform the definition of Tsum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The DownCast kernel will perform the same functionality, and more, and with higher performance (we believe). My question is why not replace this line with a DownCast launch.
The larger point here is that both the regular and deterministic paths in SparseTensorDenseMatMulFunctor do the same thing. The only difference is the choice of Tsum. I'm wondering why these two codepaths cannot be merged into one codepath. As it stands, it looks like the same codepath replicated.
Merging the code paths requires some careful thought, of course, but it seems doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular codepath handles Tsum == T, which is fine. This goes beyond what the deterministic codepath needs. In the implemented deterministic modes, Tsum != T, and the regular codepath handles that case too. It seems that the regular codepath can be augmented very slightly to support the deterministic functionality as well. Let me know if you would like to discuss and I can go through this with you.
| if (len(gpus) > 0 and determinism_required and | ||
| x.dtype in (np.float64, np.complex128)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To limit the amount of determinism-related stuff in this base test, you could replace len(gpus) > 0 and determinism_required with self.__class__ == SparseTensorDenseMatmulDeterministicTest. This will remove the requirement for the three lines of code above that set determinism_required and will make this base-test code agnostic to how determinism is enabled (or if it's enabled at all).
Also (np.float64, no.complex128) can be replaced with self._getGpuDeterminismUnimplementedTypes(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class, which returns (np.float64, no.complex128) and can also be used by (called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the list of unimplemented types will only be defined once, in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| "No deterministic GPU implementation of sparse_dense_matmul " | ||
| "available for data of type tf.float64 or tf.complex128"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string could be returned by self._getGpuDeterminismUnimplementedErrorString(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class that returns the string and can also be used by (and called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the string only being defined once, in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has now been addressed.
| gpu_determinism_implemented_types = (np.float16, np.float32, np.complex64) | ||
|
|
||
| for data_type in gpu_determinism_implemented_types: | ||
| sparse_input, dense_input = self._gen_data( | ||
| m=2430, k=615, n=857, nnz=(1<<16)+243, row_occupied_rate=0.02, | ||
| data_type=data_type) | ||
|
|
||
| repeat_count = 5 | ||
| with self.session(force_gpu=True): | ||
| if data_type in (np.float64, np.complex128): | ||
| with self.assertRaisesRegex( | ||
| errors.UnimplementedError, | ||
| "No deterministic GPU implementation of sparse_dense_matmul " | ||
| "available for data of type tf.float64 or tf.complex128"): | ||
| result_ = sparse_ops.sparse_tensor_dense_matmul( | ||
| sparse_input, dense_input) | ||
| self.evaluate(result_) | ||
| else: | ||
| result_a = sparse_ops.sparse_tensor_dense_matmul(sparse_input, | ||
| dense_input) | ||
| for _ in range(repeat_count): | ||
| result_b = sparse_ops.sparse_tensor_dense_matmul(sparse_input, | ||
| dense_input) | ||
| self.assertAllEqual(result_a, result_b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will never test the exception cases. We want to test implemented and unimplemented types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has now been addressed.
|
@wenscarl: This is feedback about your commit messages. The following three commits do much more than what the associated messages suggest:
The first line of a commit message should summarize what the commit does, not list one of the many things it does. In cases like this, where each commit is addressing a review step, I recommend,
replacing These commit messages should be meaningful so that the list of commits in the PR make sense and also so that the commits make sense in the commit log of the master branch after the merge (assuming the commits don't get squashed). |
| sp_x_value = sparse_tensor.SparseTensorValue( | ||
| indices=x_indices, values=x_values, dense_shape=x_shape) | ||
|
|
||
| if (self.__class__ != SparseTensorDenseMatMulTestBase and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice solution to the problem of the SparseTensorDenseMatmulDeterministicTest not being available in this scope.
This solution is a little brittle however because in that it relies on SparseTensorDenseMatmulTest not inheriting from SparseTensorDenseMatMulTestBase (as SparseTensorDenseMatmulDeterministicTest does) but instead SparseTensorDenseMatmulTest being a direct reference to SparseTensorDenseMatMulTestBase.
An alternative way of checking if this code is running inside an instance of the SparseTensorDenseMatmulDeterministicTest class, without having access to its token, is
self.__class__.__name__ == "SparseTensorDenseMatmulDeterministicTest"
Another option is to test for duck type:
callable(getattr(self, "_getGpuDeterminismUnimplementedTypes", None))
| "No deterministic GPU implementation of sparse_dense_matmul " | ||
| "available for data of type tf.float64 or tf.complex128"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this has now been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review step 4. There's also still a typo in the PR title; see this earlier comment. We're also still discussing some items from the previous review step.
8f53dfc to
d11a6f8
Compare
|
@wenscarl, there is no review step 0. I reviewed it four times: 1, 2, 3, 4. The first time I reviewed this PR was step 1, or pass 1, or round 1.
|
d11a6f8 to
94cac89
Compare
|
@wenscarl Can you please resolve conflicts? Thanks! |
|
I only skimmed this PR so far, but it appears it achieves determinism by summing float16 and float32 values with a float64 accumulator, using AtomicAdd. But I think this is still nondeterministic in certain cases. For example, suppose you sum the following float32 numbers but with a float64 accumulator: I'm trying to think of a way to resolve this, but I haven't thought of anything yet. @wenscarl @duncanriach @sanjoy any ideas? |
|
Would Kahan summation help?
…On Fri, Apr 23, 2021, 3:32 PM Reed ***@***.***> wrote:
I only skimmed this PR so far, but it appears it achieves determinism by
summing float16 and float32 values with a float64 accumulator, using
AtomicAdd. But I think this is still nondeterministic in certain cases. For
example, suppose you sum the following float32 numbers but with a float64
accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get
2**-60, but if you compute 1 + (-1 + 2**-60), you get 0.
I'm trying to think of a way to resolve this, but I haven't thought of
anything yet. @wenscarl <https://github.com/wenscarl> @duncanriach
<https://github.com/duncanriach> @sanjoy <https://github.com/sanjoy> any
ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47749 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANWFG2OGK54QUOZTE5BVRDTKHYRDANCNFSM4ZBQZ42A>
.
|
|
Related old discussion here:
https://forums.developer.nvidia.com/t/atomicadd-kahan-summation/37155/4
…On Fri, Apr 23, 2021, 3:39 PM ebrevdo ***@***.***> wrote:
Would Kahan summation help?
On Fri, Apr 23, 2021, 3:32 PM Reed ***@***.***> wrote:
> I only skimmed this PR so far, but it appears it achieves determinism by
> summing float16 and float32 values with a float64 accumulator, using
> AtomicAdd. But I think this is still nondeterministic in certain cases.
For
> example, suppose you sum the following float32 numbers but with a float64
> accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get
> 2**-60, but if you compute 1 + (-1 + 2**-60), you get 0.
>
> I'm trying to think of a way to resolve this, but I haven't thought of
> anything yet. @wenscarl <https://github.com/wenscarl> @duncanriach
> <https://github.com/duncanriach> @sanjoy <https://github.com/sanjoy> any
> ideas?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#47749 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AANWFG2OGK54QUOZTE5BVRDTKHYRDANCNFSM4ZBQZ42A
>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#47749 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANWFG7IGE7MDSDTYOJZG3LTKHZJZANCNFSM4ZBQZ42A>
.
|
|
That's a good suggestion, but unfortunately even Kahan summation can cause nondeterminism :( In particular, consider summing |
|
@reedwm. This is really good, and valid, challenge to this approach. Thank you. I didn't think of this when I challenged it, and then studied it, on PR 39751. I hope we can find a solution. |
|
@ebrevdo, thanks for the suggestion. I had to look up Kahan summation. To extend what @reedwm said, the challenge is not necessarily to minimize error (which is what Kahan summation attempts to do) but to get the same error no matter what the order of operations is. It seems to me that it would be tricky and/or very costly to parallelize Kahan summation in any case. In its serial form, Kahan summation would be deterministic, but so would normal floating-point summation. BTW, @wenscarl and I worked on other approaches to making this op deterministic, approaches that ensure reproducible ordering of operations, but we found those to be much slower (after a certain amount of time spent tuning them), and harder to make deadlock-safe, than this one. Hopefully we can make this one work. However, it does seem to have a fundamental flaw. |
|
I tried thinking of a solution, but unfortunately have not been able to think of any way to make this op deterministic on the GPU. What approaches have you and @wenscarl come up with? The only solution I know of would be to run the op on the CPU when determinism is enabled, but this might have significant performance impacts and unfortunately would also mean we could not use most of this PR (we could keep the tests, modifying them to only run on the CPU) |
|
The other solutions are variations on the following algorithm, which @wenscarl came up with: With each thread-block, each thread writes its matmul result into shared memory. Then bitonic sort (using all threads in the block) deterministically sorts the results into groups with matching output addresses. Then the right-most thread (the accumulator thread) in each group serially (and therefore deterministically) reduces the results for the group and uses CUDA atomicAdd to write the result out to memory. All the accumulator threads in the same thread-block will be, by definition, writing to different addresses and therefore there is no concern about nondeterministic ordering of those atomic additions, at least from within the same thread-block. On the other hand, multiple thread-blocks may be writing to the same address using CUDA atomicAdd. To ensure a deterministic result in this case, the accumulator threads wait for their thread-block’s turn to write to memory using a GPU memory semaphore based on the thread-block ID. Variations include inter-thread-block synchronization mechanism details, whether to use atomics, and different approaches to intra-thread-block reduction. |
|
Note that the approach that is currently in this PR, the approach that seems to be fatally flawed, was developed when the op only supported Switching to a different approach will likely take significant extra development time, even if it's one that we've already explored. It seems to me that defaulting to running an op on CPU is an alternative approach to throwing exceptions; I'm wondering if it's a better approach (there are pros and cons either way). I think that, for now, we should stick with the plan of throwing exceptions (rather than running on CPU) and consider revisiting the plan. |
|
Thoughts on the following plan, @reedwm? Leave this PR open for a few days, maybe a week, to see if a solution bubbles up for the apparently unsolvable problem with this approach. If no way of recovering this approach arises, then we close this PR to memorialize the dead-end without painting over it. Then either I or @wenscarl will open either:
Or possibly (1) followed by (2). |
|
That sounds good to me. Agreed we should not go with running on the CPU for now; we can reconsider in the future. |
|
In terms of an alternative approach, it might be possible to recast this op as something that looks a bit like a weighted sparse segment reduction. If so, the deterministic GPU kernel in #47974 might be useful. (Actually I already have a follow-up PR to that one that adds weights in order to support the Grad implementation). |
|
Are the tests from this PR still usable? If the CPU version is deterministic, the tests can be added now, to be run on the CPU only. Otherwise, they can be added once either the CPU or GPU version has a deterministic version. |
|
The CPU version is deterministic, although this PR did not run the tests on CPU (because it was focused on providing GPU-determinism). The PR that adds the exception-throwing for this op will also reuse the determinism tests from this PR and run them on the CPU as well as the exception tests from this PR. |
Add GPU determinism for fp types in GPU SparseTensorDenseMatMul. The supported types are fp32, fp16 and complex64. For non-supported types including fp64 and complex128, a unimplemented exception will be thrown if environment variable
TF_DETERMINISTIC_OPSis set to be1ortrue.