[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul #47749

wenscarl · 2021-03-12T04:07:51Z

Add GPU determinism for fp types in GPU SparseTensorDenseMatMul. The supported types are fp32, fp16 and complex64. For non-supported types including fp64 and complex128, a unimplemented exception will be thrown if environment variable TF_DETERMINISTIC_OPS is set to be 1 or true.

google-cla · 2021-03-12T04:08:17Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

google-cla · 2021-03-12T04:25:51Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

…nseMatMul

google-cla · 2021-03-12T04:29:08Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

duncanriach · 2021-03-12T17:08:20Z

Typo in PR title: determinisim -> determinism

duncanriach · 2021-03-12T17:11:01Z

@gbaned: please let me review this PR first. Then we'll move it out of WIP/draft and then it can be further reviewed, if considered necessary.

Tagging @sanjoy for visibility.

duncanriach

First-pass review completed. I'm happy to chat with you, one-on-one, about any of this.

After you have addressed these comments, I'll review again, looking deeper.

duncanriach · 2021-03-12T17:28:32Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_deterministic_test.py

+      with self.session(force_gpu=True):
+
+        if data_type in unimplemented_types:
+          with self.assertRaises(errors.UnimplementedError):


Please use self.assertRaisesRegex to check the exception message.

duncanriach · 2021-03-12T17:36:09Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+  return _timeit
+
+
+def sparse_tensor_dense_vs_dense_matmul_benchmark(thresh,


I think that this function, and the previous two private functions that it calls, should be left in their original location in sparse_tensor_dense_matmul_op_test.py. They should be kept with the code that calls them, to minimize complexity. I don't see any reason to put them in the base file and make them available to the deterministic tests.

duncanriach · 2021-03-12T17:37:09Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_test.py

@@ -1,4 +1,4 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.


Change this to 2015-2021 after moving the benchmark code back in here.

duncanriach · 2021-03-12T17:44:47Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op.cc

+                                     /*default_val=*/false,
+                                     &deterministic_ops));
+      if (deterministic_ops) {
+        OP_REQUIRES(


This check should not be run here, because it will run on either a CPU or a GPU. You need to find a place in the codepath (in the CUDA kernel launch code, and not inside the CUDA kernel) where the code will definitely only be running on a GPU.

It seems as though you've added the integral type check to prevent exceptions being thrown on CPU. This will not cover the case where the op is forced to run on CPU, and the type is float64 or complex128, which presumably do operate deterministically on the CPU. Therefore this code will throw incorrect exceptions on CPU.

By making this check GPU-only, it will simplify it, and reduce complexity.

It's a TF standard to put OP_REQUIRES in the *.cc file instead of *.cu.cc otherwise compiling error occurs. Reason is, the functor::compute in *.cu.cc returns a Status but compute in *.cc returns void. If there is a OP_REQUIRES branch in the functor::compute, a return value is mandate.

It is possible, depending on the way the op is coded, to use OP_REQUIRES and OP_REQUIRES_ASYNC in the CUDA kernel launch code in the *.cu.cc file, and some ops do this. It's also possible to add the error to the context, as some other ops do. If, as in this particular op, the signature of the per-device implementation functor returns Status, then it suggests that we need to work with that pattern (set an error as well as, or via, returning a Status). The objectives have to be met while working within reasonable constraints, but we should not default to introducing bugs, reducing performance significantly, or introducing unnecessary complexity because one apparent pattern seems to conflict with another apparent pattern. It's sometimes necessary to wrestle a little bit longer with this kind of thing and/or seek consultation in order to meet the requirements of what needs to be implemented.

One part-way solution, could be to put the OP_REQUIRES into the existing if (std::is_same<Device, GPUDevice>::value) block in the *.cc file; this would simplify and accelerate. The RequireDeterminism function could be defined on the *.cu.cc file, declared in a shared header, and reused. However, I think it's better (simpler/faster) to check for RequireDeterminism() once, in one place (in the *.cu.cc file), assuming a good way to do that can be figured out.

It looks like maybe we just have to return errors:Unimplemented(...).

Looking at, for example, barrier_ops.cc, that contains functions which return Status and throw exceptions.

When the status is passed up the stack like that, a higher level handles it (with OP_REQUIRES_OK), putting it into the context. If it's not passed up (as in functions that return void) the OP_REQUIRES puts it into the context.

In summary: we either throw the exception or we pass it up via returned Status.

Follow-up from phone conversation with @wenscarl: I had missed (the context) that this code was already inside the if (std::is_same<Device, GPUDevice>::value) block, so the "part-way" solution, that I mentioned above, is how it already was.

duncanriach · 2021-03-12T17:48:37Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op.cc

                  errors::InvalidArgument(
                      "Cannot use GPU when output.shape[1] * nnz(a) > 2^31"));
+      bool deterministic_ops = false;
+      TF_CHECK_OK(ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",


This should not be checked every time the op compute runs. Please follow the existing pattern by caching the environment variable in a static variable inside a function.

Given that this check will/should move down into the _gpu.cu.cc code, I don't think you'll need to do it twice, you can do the exception check/throw in the same place as choosing the deterministic functionality. So this pattern here should be eliminated.

duncanriach · 2021-03-12T17:53:17Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op.cc

+EIGEN_ALWAYS_INLINE EIGEN_DEVICE_FUNC bool DeterminismTypesCheck() {
+  return TF_PREDICT_TRUE(
+      std::is_integral<T>::value ||
+      (!std::is_same<T, double>::value && !std::is_same<T, complex128>::value));


The integral check should be eliminated.

In the kernel-launching code, in the RequireDeterminism() == true branch, there are two options: (1) if std::is_same<T, double>::value || std::is_same<T, complex128>::value then throw the exception, else (2) do the deterministic kernel launch.

That should not need to be put in a function or inlined, and I think it's going to be really easy to understand.

wenscarl · 2021-03-12T18:50:33Z

@googlebot I signed it!

duncanriach

I have completed a second review pass. Please make changes and then I will review again, possibly going deeper.

duncanriach · 2021-03-15T22:57:58Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc


+template <typename Tsrc, typename Tdst>
+__global__ void DownCast(
+    const int size, const Tsrc* src, Tdst* __restrict__ dst) {


It's not clear to me if the compiler's default assumption of pointer aliasing could slow down the operation in these DownCast functions (but it's better to be safe and to let it know that there is no pointer aliasing), however, to avoid pointer aliasing, doesn't __restrict__ have to be applied to both src and dst? I just reviewed this blog post (again).

duncanriach · 2021-03-16T00:56:36Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+
+        with self.assertRaisesRegex(
+            errors.UnimplementedError,
+            "No deterministic GPU implementation of *"):


Do you have a reason for not checking that all the words/characters in the message are correct?

To me, it seems better to check it completely. If the message changes, either intentionally or accidentally, this test will then fail, and it will bring conscious attention to this being tested.

Note that self.assertRaisesRegex can test against a regular expression (as you have used it), but it can also be used to compare against a string literal (a type of non-variable regular expression).

duncanriach · 2021-03-16T01:16:36Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+      required_determinisim = \
+          True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False
+
+      unimplemented_types = (np.float64, np.complex128)


I think that this variable name should be more explicit because in the context of this test code, "unimplemented types" means that matmul is not implemented at all for those types, which is not what this variable represents. What this variable is capturing is the types for which GPU-determinism is not implemented. Therefore, an unambiguous name for it would be gpu_determinism_unimplemented_types, otherwise its meaning can only be discerned by carefully examining how it's later used, and, ideally, we want to be able to read and understand code without having to decipher variable meaning by mentally simulating variable use.

Variables should be named such that, in the contexts in which they appear, the meaning of their contents is unambiguous.

duncanriach · 2021-03-16T01:21:42Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+def _sparse_tensor_dense_vs_dense_matmul_benchmark_dense(x, y, adjoint_a,
+                                                         adjoint_b):
+
+  def body(t, prev):
+    with ops.control_dependencies([prev]):
+      return (t + 1, math_ops.matmul(
+          x,
+          y,
+          transpose_a=adjoint_a,
+          transpose_b=adjoint_b,
+          a_is_sparse=True,
+          b_is_sparse=False))
+
+  t0 = constant_op.constant(0)
+  v0 = constant_op.constant(0.0)
+
+  def _timeit(iterations, _):
+    (_, final) = control_flow_ops.while_loop(
+        lambda t, _: t < iterations,
+        body, (t0, v0),
+        parallel_iterations=1,
+        back_prop=False,
+        shape_invariants=(tensor_shape.TensorShape(()),
+                          tensor_shape.TensorShape(None)))
+    return [final]
+
+  return _timeit
+
+
+def _sparse_tensor_dense_vs_dense_matmul_benchmark_sparse(x_ind, x_val, x_shape,
+                                                          y, adjoint_a,
+                                                          adjoint_b):
+  sp_x = sparse_tensor.SparseTensor(
+      indices=x_ind, values=x_val, dense_shape=x_shape)
+
+  def body(t, prev):
+    with ops.control_dependencies([prev]):
+      return (t + 1, sparse_ops.sparse_tensor_dense_matmul(
+          sp_x, y, adjoint_a=adjoint_a, adjoint_b=adjoint_b))
+
+  t0 = constant_op.constant(0)
+  v0 = constant_op.constant(0.0)
+
+  def _timeit(iterations, _):
+    (_, final) = control_flow_ops.while_loop(
+        lambda t, _: t < iterations,
+        body, (t0, v0),
+        parallel_iterations=1,
+        back_prop=False,
+        shape_invariants=(tensor_shape.TensorShape(()),
+                          tensor_shape.TensorShape(None)))
+    return [final]
+
+  return _timeit


Please refer to this comment in my previous review cycle. These two functions are essentially subroutines of the benchmark function, which is not referenced in the base file or the deterministic test file at all. These functions should be returned to their original locations in sparse_tensor_dense_matmul_op_test.py.

duncanriach · 2021-03-16T01:23:28Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+      required_determinisim = \
+          True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False


Please refer to section 3.2 (Line Length) of the Google Python Style Guide: "Do not use backslash line continuation except for with statements requiring three or more context managers. Make use of Python’s implicit line joining inside parentheses, brackets and braces. If necessary, you can add an extra pair of parentheses around an expression."

There is another instance of incorrect use of backslash for line continuation in the code below.

Also this variable name includes an incorrect spelling of "determinism." Please call it, determinism_required.

duncanriach · 2021-03-16T01:50:32Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+    GPU_1D_KERNEL_LOOP(index, size) {
+        dst[index] = static_cast<Tdst>(src[index]);
+    }


I think this line of code should be implemented with multiples of two (not four) spaces: outer=2, inner=4.

duncanriach · 2021-03-16T01:51:56Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+            DataTypeToEnum<Tupcast>::value,
+            TensorShape({out.dimension(0), out.dimension(1)}), &temp_out_t));


This block of code should be indented with four spaces, not five.

duncanriach · 2021-03-16T01:53:02Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+          SparseTensorDenseMatMulKernel<T, Tupcast, Tindices, ADJ_A, ADJ_B>,
+          config.block_count, config.thread_per_block, 0, d.stream(), nnz, m,
+          b_rows, b_cols, p, a_indices.data(), a_values.data(), b.data(),
+          maybe_temp_out_data));


This block of code should be indented with four spaces, not three.

Please will you review all the code that you're adding and make sure that it follows the intending rules. Each level of statement indent should add two spaces. Starting a list of parameters on a new line [ending the previous line with a "("] should indent four spaces.

It might seem pedantic, but it's important to generate code in a standard way to make it easy to read and maintain. Search for, and review, all 30 instances of the word "indent" in the Google C++ Style Guide.

tensorflow/python/kernel_tests/BUILD

+cuda_py_test(
+    name = "sparse_tensor_dense_matmul_op_deterministic_test",
+    size = "small",
+    srcs = ["sparse_tensor_dense_matmul_op_deterministic_test.py"],
+    deps = [
+        ":sparse_tensor_dense_matmul_op_base",
+    ],
+)


duncanriach · 2021-03-16T02:10:00Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

-
+/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
-
    http://www.apache.org/licenses/LICENSE-2.0
-


I don't think these three blank lines should be removed from the copyright header. They're part of the standard copyright header format/structure.

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+bool RequireDeterminism() {
+  static bool require_determinism = [] {
+    bool deterministic_ops = false;
+    TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",
+                                               /*default_val=*/false,
+                                               &deterministic_ops));
+    return deterministic_ops;
+  }();
+  return require_determinism;
+}
+


duncanriach

Third review pass completed. Please review my comments and revise. Let me know when you're done by requesting a re-review. I'm happy to discuss any of this with you.

duncanriach · 2021-03-24T00:34:54Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

 }

 template <typename Tout, typename Tin>
 __device__ void reduction(Tout* out_location, Tin val){


There should always be a space between the ) and the {. See here. I don't know if these kind of formatting issues are currently auto-fixed, break a CI lint check, or just slip through, but I think it's worth fixing them up-front. Please check that this formatting error is not repeated elsewhere in your PR. I see it in the next function definition, for example.

Note to self: this has been addressed.

duncanriach · 2021-03-24T00:44:51Z

tensorflow/python/kernel_tests/BUILD

    name = "sparse_tensor_dense_matmul_op_test",
    size = "medium",
    srcs = ["sparse_tensor_dense_matmul_op_test.py"],
-    tfrt_enabled = True,


Nice catch. I'm wondering how this got into the PR in the first place.

duncanriach · 2021-03-24T01:02:33Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+}
+
+template<>
+__device__ void reduction(complex128* out_location, complex64 val){


Function names should start with a capital letter; see here.

Reduction is a relatively generic and non-descriptive noun. What does this function actually do? It re-casts and then adds. Following the existing name, how about GpuRecastAtomicAdd? This function recast-adds.

Note to self: this has been addressed.

duncanriach · 2021-03-24T01:04:41Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

-    GpuAtomicAdd(out_location,
-                 static_cast<Tsum>(a_value) * static_cast<Tsum>(b_value));
+
+    reduction<Tsum, T>(out_location, a_value * b_value);


I notice that you haven't replaced the call to GpuAtomicAdd earlier in this function. Do you have a reason for or against that?

The earlier GpuAtomicAdd doesn't need to do the recast because the template passed in is Tsum already.

I see and understand that now. Thank you.

duncanriach · 2021-03-24T01:23:32Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+        // Note: The reinterpret cast is only required to avoid a compilation
+        // error; it is only used if Tsum == T.
+        maybe_temp_out_data = reinterpret_cast<Tsum*>(out.data());
+        out.device(d) = out.constant(T(0));


I wonder why this code is using this slower form of zeroing. Perhaps this should be made to run faster (?) by using SetZero (as in the other branch of the conditional).

Maybe just nobody catches that. Fixed it.

Note to self: this has now been addressed.

duncanriach · 2021-03-24T01:31:35Z

tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc

+          maybe_temp_out_data));
+
+      if (sum_type_is_different) {
+        out.device(d) = temp_out_t.matrix<Tsum>().template cast<T>();


Isn't this line of code basically the same as our DownCast function, but possibly slower?

Isn't the only difference between the deterministic and nondeterministic codepaths in the choice of Tsum?

Could this code be re-written more simply to run exactly the same way for both regular and deterministic operation except that for regular operation, Tsum = typename SumType<T>::type while for deterministic operation, Tsum = typename SumType<T>::type_for_determinism?

Maybe I'm missing some other difference between the two code paths.

Echo my earlier thought. This line of code is designed to handle fp16 particularly. For complex type, temp_out_t.matrix<Tsum>().template cast<T>() cannot handle properly and has to go through our DownCast kernel. But we do can uniform the definition of Tsum.

Right. The DownCast kernel will perform the same functionality, and more, and with higher performance (we believe). My question is why not replace this line with a DownCast launch.

The larger point here is that both the regular and deterministic paths in SparseTensorDenseMatMulFunctor do the same thing. The only difference is the choice of Tsum. I'm wondering why these two codepaths cannot be merged into one codepath. As it stands, it looks like the same codepath replicated.

Merging the code paths requires some careful thought, of course, but it seems doable.

The regular codepath handles Tsum == T, which is fine. This goes beyond what the deterministic codepath needs. In the implemented deterministic modes, Tsum != T, and the regular codepath handles that case too. It seems that the regular codepath can be augmented very slightly to support the deterministic functionality as well. Let me know if you would like to discuss and I can go through this with you.

duncanriach · 2021-03-24T01:58:18Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+      if (len(gpus) > 0 and determinism_required and
+          x.dtype in (np.float64, np.complex128)):


To limit the amount of determinism-related stuff in this base test, you could replace len(gpus) > 0 and determinism_required with self.__class__ == SparseTensorDenseMatmulDeterministicTest. This will remove the requirement for the three lines of code above that set determinism_required and will make this base-test code agnostic to how determinism is enabled (or if it's enabled at all).

Also (np.float64, no.complex128) can be replaced with self._getGpuDeterminismUnimplementedTypes(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class, which returns (np.float64, no.complex128) and can also be used by (called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the list of unimplemented types will only be defined once, in one place.

Really good suggestion.

duncanriach · 2021-03-24T02:01:14Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+            "No deterministic GPU implementation of sparse_dense_matmul "
+            "available for data of type tf.float64 or tf.complex128"):


This string could be returned by self._getGpuDeterminismUnimplementedErrorString(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class that returns the string and can also be used by (and called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the string only being defined once, in one place.

Note to self: this has now been addressed.

duncanriach · 2021-03-24T02:06:03Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_deterministic_test.py

+    gpu_determinism_implemented_types = (np.float16, np.float32, np.complex64)
+
+    for data_type in gpu_determinism_implemented_types:
+      sparse_input, dense_input = self._gen_data(
+          m=2430, k=615, n=857, nnz=(1<<16)+243, row_occupied_rate=0.02,
+          data_type=data_type)
+
+      repeat_count = 5
+      with self.session(force_gpu=True):
+        if data_type in (np.float64, np.complex128):
+          with self.assertRaisesRegex(
+              errors.UnimplementedError,
+              "No deterministic GPU implementation of sparse_dense_matmul "
+              "available for data of type tf.float64 or tf.complex128"):
+            result_ = sparse_ops.sparse_tensor_dense_matmul(
+                sparse_input, dense_input)
+            self.evaluate(result_)
+        else:
+          result_a = sparse_ops.sparse_tensor_dense_matmul(sparse_input,
+                                                           dense_input)
+          for _ in range(repeat_count):
+            result_b = sparse_ops.sparse_tensor_dense_matmul(sparse_input,
+                                                             dense_input)
+            self.assertAllEqual(result_a, result_b)


This code will never test the exception cases. We want to test implemented and unimplemented types.

Note to self: this has now been addressed.

duncanriach · 2021-03-24T21:37:30Z

@wenscarl: This is feedback about your commit messages. The following three commits do much more than what the associated messages suggest:

The first line of a commit message should summarize what the commit does, not list one of the many things it does. In cases like this, where each commit is addressing a review step, I recommend,

[determinism] Address review, step n, on PR 47749,

replacing n with the review step that is being addressed by the commit. We're currently on review step 4 for this PR; your next commit will be for step 4.

These commit messages should be meaningful so that the list of commits in the PR make sense and also so that the commits make sense in the commit log of the master branch after the merge (assuming the commits don't get squashed).

duncanriach · 2021-03-24T21:56:37Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+      sp_x_value = sparse_tensor.SparseTensorValue(
+          indices=x_indices, values=x_values, dense_shape=x_shape)
+
+      if (self.__class__ !=  SparseTensorDenseMatMulTestBase and


Nice solution to the problem of the SparseTensorDenseMatmulDeterministicTest not being available in this scope.

This solution is a little brittle however because in that it relies on SparseTensorDenseMatmulTest not inheriting from SparseTensorDenseMatMulTestBase (as SparseTensorDenseMatmulDeterministicTest does) but instead SparseTensorDenseMatmulTest being a direct reference to SparseTensorDenseMatMulTestBase.

An alternative way of checking if this code is running inside an instance of the SparseTensorDenseMatmulDeterministicTest class, without having access to its token, is

self.__class__.__name__ == "SparseTensorDenseMatmulDeterministicTest"

Another option is to test for duck type:

callable(getattr(self, "_getGpuDeterminismUnimplementedTypes", None))

duncanriach · 2021-03-24T21:57:07Z

tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_base.py

+            "No deterministic GPU implementation of sparse_dense_matmul "
+            "available for data of type tf.float64 or tf.complex128"):


Note to self: this has now been addressed.

duncanriach

Review step 4. There's also still a typo in the PR title; see this earlier comment. We're also still discussing some items from the previous review step.

duncanriach · 2021-03-25T01:38:06Z

@wenscarl, there is no review step 0. I reviewed it four times: 1, 2, 3, 4. The first time I reviewed this PR was step 1, or pass 1, or round 1.

768a1a6 addresses review (step 1) - there is no step zero in the review.
9ec0be6 fixes a typo - your original commit message made sense
1f6f4b6 addresses review (step 2)
a1398b2 addresses review (step 3)
d11a6f8 addressees review (step 4) but it also, arguably, changes the code in a way that could be summarized in one line.

gbaned · 2021-04-07T18:09:52Z

@wenscarl Can you please resolve conflicts? Thanks!

…-types

wenscarl · 2021-04-14T16:13:07Z

@wenscarl Can you please resolve conflicts? Thanks!

@gbaned The conflict is resolved. Please move forward. Thanks!

reedwm · 2021-04-23T22:32:30Z

I only skimmed this PR so far, but it appears it achieves determinism by summing float16 and float32 values with a float64 accumulator, using AtomicAdd. But I think this is still nondeterministic in certain cases. For example, suppose you sum the following float32 numbers but with a float64 accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get 2**-60, but if you compute 1 + (-1 + 2**-60), you get 0.

I'm trying to think of a way to resolve this, but I haven't thought of anything yet. @wenscarl @duncanriach @sanjoy any ideas?

ebrevdo · 2021-04-23T22:39:06Z

Would Kahan summation help?

…

On Fri, Apr 23, 2021, 3:32 PM Reed ***@***.***> wrote: I only skimmed this PR so far, but it appears it achieves determinism by summing float16 and float32 values with a float64 accumulator, using AtomicAdd. But I think this is still nondeterministic in certain cases. For example, suppose you sum the following float32 numbers but with a float64 accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get 2**-60, but if you compute 1 + (-1 + 2**-60), you get 0. I'm trying to think of a way to resolve this, but I haven't thought of anything yet. @wenscarl <https://github.com/wenscarl> @duncanriach <https://github.com/duncanriach> @sanjoy <https://github.com/sanjoy> any ideas? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47749 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANWFG2OGK54QUOZTE5BVRDTKHYRDANCNFSM4ZBQZ42A> .

ebrevdo · 2021-04-23T22:41:43Z

Related old discussion here: https://forums.developer.nvidia.com/t/atomicadd-kahan-summation/37155/4

…

On Fri, Apr 23, 2021, 3:39 PM ebrevdo ***@***.***> wrote: Would Kahan summation help? On Fri, Apr 23, 2021, 3:32 PM Reed ***@***.***> wrote: > I only skimmed this PR so far, but it appears it achieves determinism by > summing float16 and float32 values with a float64 accumulator, using > AtomicAdd. But I think this is still nondeterministic in certain cases. For > example, suppose you sum the following float32 numbers but with a float64 > accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get > 2**-60, but if you compute 1 + (-1 + 2**-60), you get 0. > > I'm trying to think of a way to resolve this, but I haven't thought of > anything yet. @wenscarl <https://github.com/wenscarl> @duncanriach > <https://github.com/duncanriach> @sanjoy <https://github.com/sanjoy> any > ideas? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #47749 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AANWFG2OGK54QUOZTE5BVRDTKHYRDANCNFSM4ZBQZ42A > > . > — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47749 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANWFG7IGE7MDSDTYOJZG3LTKHZJZANCNFSM4ZBQZ42A> .

reedwm · 2021-04-23T23:01:07Z

That's a good suggestion, but unfortunately even Kahan summation can cause nondeterminism :(

In particular, consider summing 2**60 + 1 + 2**-60 - 1 - 2**60. The true result is 2**-60, which can be obtained with float32 arithmetic by rearranging the terms. But if summed without rearranging the terms, even with float64 accumulation and Kahan summation, results in 0. The issue is Kahan summation only stores two floating-point accumulators to represent the sum, but there are three numbers with very different magnitudes in the summation.

duncanriach · 2021-04-26T21:11:00Z

@reedwm. This is really good, and valid, challenge to this approach. Thank you. I didn't think of this when I challenged it, and then studied it, on PR 39751.

I hope we can find a solution.

duncanriach · 2021-04-26T22:25:17Z

@ebrevdo, thanks for the suggestion. I had to look up Kahan summation. To extend what @reedwm said, the challenge is not necessarily to minimize error (which is what Kahan summation attempts to do) but to get the same error no matter what the order of operations is. It seems to me that it would be tricky and/or very costly to parallelize Kahan summation in any case. In its serial form, Kahan summation would be deterministic, but so would normal floating-point summation.

BTW, @wenscarl and I worked on other approaches to making this op deterministic, approaches that ensure reproducible ordering of operations, but we found those to be much slower (after a certain amount of time spent tuning them), and harder to make deadlock-safe, than this one. Hopefully we can make this one work. However, it does seem to have a fundamental flaw.

reedwm · 2021-04-26T22:46:13Z

I tried thinking of a solution, but unfortunately have not been able to think of any way to make this op deterministic on the GPU. What approaches have you and @wenscarl come up with?

The only solution I know of would be to run the op on the CPU when determinism is enabled, but this might have significant performance impacts and unfortunately would also mean we could not use most of this PR (we could keep the tests, modifying them to only run on the CPU)

duncanriach · 2021-04-26T23:18:10Z

The other solutions are variations on the following algorithm, which @wenscarl came up with:

With each thread-block, each thread writes its matmul result into shared memory. Then bitonic sort (using all threads in the block) deterministically sorts the results into groups with matching output addresses. Then the right-most thread (the accumulator thread) in each group serially (and therefore deterministically) reduces the results for the group and uses CUDA atomicAdd to write the result out to memory. All the accumulator threads in the same thread-block will be, by definition, writing to different addresses and therefore there is no concern about nondeterministic ordering of those atomic additions, at least from within the same thread-block. On the other hand, multiple thread-blocks may be writing to the same address using CUDA atomicAdd. To ensure a deterministic result in this case, the accumulator threads wait for their thread-block’s turn to write to memory using a GPU memory semaphore based on the thread-block ID.

Variations include inter-thread-block synchronization mechanism details, whether to use atomics, and different approaches to intra-thread-block reduction.

duncanriach · 2021-04-26T23:40:49Z

Note that the approach that is currently in this PR, the approach that seems to be fatally flawed, was developed when the op only supported tf.float32 on GPU. The addition of support for tf.float64 and tf.complex128 on GPU meant that this approach could not be extended to all GPU data types. An advantage of switching to another approach is that it would almost certainly extend to all data types supported on GPU.

Switching to a different approach will likely take significant extra development time, even if it's one that we've already explored.

It seems to me that defaulting to running an op on CPU is an alternative approach to throwing exceptions; I'm wondering if it's a better approach (there are pros and cons either way). I think that, for now, we should stick with the plan of throwing exceptions (rather than running on CPU) and consider revisiting the plan.

duncanriach · 2021-04-27T02:52:16Z

Thoughts on the following plan, @reedwm?

Leave this PR open for a few days, maybe a week, to see if a solution bubbles up for the apparently unsolvable problem with this approach.

If no way of recovering this approach arises, then we close this PR to memorialize the dead-end without painting over it.

Then either I or @wenscarl will open either:

a PR to add throwing of GPU determinism-unimplemented exceptions for this op or
a PR to add a GPU-deterministic implementation for this op via another approach.

Or possibly (1) followed by (2).

reedwm · 2021-04-27T03:04:55Z

That sounds good to me. Agreed we should not go with running on the CPU for now; we can reconsider in the future.

benbarsdell · 2021-04-27T22:07:10Z

In terms of an alternative approach, it might be possible to recast this op as something that looks a bit like a weighted sparse segment reduction. If so, the deterministic GPU kernel in #47974 might be useful. (Actually I already have a follow-up PR to that one that adds weights in order to support the Grad implementation).

duncanriach · 2021-05-06T21:00:53Z

Okay, @reedwm. Please will you close this PR. The next step (see the plan above) is (for me) to implement exception-throwing for GPU kernels.

reedwm · 2021-05-06T23:36:10Z

Are the tests from this PR still usable? If the CPU version is deterministic, the tests can be added now, to be run on the CPU only. Otherwise, they can be added once either the CPU or GPU version has a deterministic version.

duncanriach · 2021-05-07T00:22:04Z

The CPU version is deterministic, although this PR did not run the tests on CPU (because it was focused on providing GPU-determinism). The PR that adds the exception-throwing for this op will also reuse the determinism tests from this PR and run them on the CPU as well as the exception tests from this PR.

google-ml-butler bot added the size:XL CL Change Size:Extra Large label Mar 12, 2021

google-cla bot added the cla: no label Mar 12, 2021

wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from 8541446 to bf5ce24 Compare March 12, 2021 04:25

[determinism] Add GPU determinisim for fp types in GPU SparseTensorDe…

b2bffac

…nseMatMul

wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from bf5ce24 to b2bffac Compare March 12, 2021 04:29

gbaned self-assigned this Mar 12, 2021

wenscarl changed the title ~~[determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul~~ WIP:[determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul Mar 12, 2021

duncanriach suggested changes Mar 12, 2021

View reviewed changes

google-cla bot added cla: yes and removed cla: no labels Mar 12, 2021

duncanriach suggested changes Mar 16, 2021

View reviewed changes

duncanriach reviewed Mar 22, 2021

View reviewed changes

duncanriach suggested changes Mar 24, 2021

View reviewed changes

duncanriach reviewed Mar 24, 2021

View reviewed changes

duncanriach suggested changes Mar 24, 2021

View reviewed changes

wenscarl changed the title ~~WIP:[determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul~~ WIP:[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul Mar 24, 2021

wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from 8f53dfc to d11a6f8 Compare March 25, 2021 00:02

wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from d11a6f8 to 94cac89 Compare March 25, 2021 01:58

wenscarl added 4 commits March 24, 2021 19:14

[determinism] Address review (step 1) on PR 47749

9f46242

[determinism] Address review (step 3) on PR 47749

25ba429

[determinism] Address review (step 4) on PR 47749

7d9842e

[determinism] Fix a typo on PR 47749

95bc9ed

gbaned requested a review from sanjoy April 1, 2021 02:35

gbaned removed kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 1, 2021

gbaned added the stat:awaiting response Status - Awaiting response from author label Apr 7, 2021

Merge branch 'master' into determinism-gpu-SparseTensorDenseMatmul-fp…

873116e

…-types

gbaned removed the stat:awaiting response Status - Awaiting response from author label Apr 21, 2021

gbaned requested a review from reedwm April 21, 2021 18:56

gbaned added the awaiting review Pull request awaiting review label Apr 21, 2021

sanjoy removed their request for review April 26, 2021 03:39

reedwm closed this May 6, 2021

duncanriach mentioned this pull request Jun 19, 2021

[determinism] Add d9m-unimplemented exceptions to sparse/sparse matmul #50355

Merged

duncanriach mentioned this pull request Jul 8, 2021

patch tf.image.resize with bilinear by casting image to tf.float64 NVIDIA/framework-reproducibility#32

Closed

tttthomasssss mentioned this pull request Sep 29, 2021

Inconsistent GPU model results RasaHQ/rasa#9737

Closed

		return _timeit


		def sparse_tensor_dense_vs_dense_matmul_benchmark(thresh,

		@@ -1,4 +1,4 @@
		# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
		# Copyright 2021 The TensorFlow Authors. All Rights Reserved.

		required_determinisim = \
		True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False

		DataTypeToEnum<Tupcast>::value,
		TensorShape({out.dimension(0), out.dimension(1)}), &temp_out_t));

		if (len(gpus) > 0 and determinism_required and
		x.dtype in (np.float64, np.complex128)):

		"No deterministic GPU implementation of sparse_dense_matmul "
		"available for data of type tf.float64 or tf.complex128"):

[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul #47749

[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul #47749

Uh oh!

Conversation

wenscarl commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-cla bot commented Mar 12, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

google-cla bot commented Mar 12, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

google-cla bot commented Mar 12, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

duncanriach commented Mar 12, 2021

Uh oh!

duncanriach commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duncanriach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Mar 12, 2021

Uh oh!

duncanriach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Mar 12, 2021 •

edited

Loading

duncanriach commented Mar 12, 2021 •

edited

Loading

duncanriach left a comment •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 16, 2021 •

edited

Loading

duncanriach left a comment •

edited

Loading