Skip to content

Conversation

@wenscarl
Copy link
Contributor

@wenscarl wenscarl commented Mar 12, 2021

Add GPU determinism for fp types in GPU SparseTensorDenseMatMul. The supported types are fp32, fp16 and complex64. For non-supported types including fp64 and complex128, a unimplemented exception will be thrown if environment variable TF_DETERMINISTIC_OPS is set to be 1 or true.

@google-ml-butler google-ml-butler bot added the size:XL CL Change Size:Extra Large label Mar 12, 2021
@google-cla
Copy link

google-cla bot commented Mar 12, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Mar 12, 2021
@wenscarl wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from 8541446 to bf5ce24 Compare March 12, 2021 04:25
@google-cla
Copy link

google-cla bot commented Mar 12, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@wenscarl wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from bf5ce24 to b2bffac Compare March 12, 2021 04:29
@google-cla
Copy link

google-cla bot commented Mar 12, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@gbaned gbaned self-assigned this Mar 12, 2021
@wenscarl wenscarl changed the title [determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul WIP:[determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul Mar 12, 2021
@duncanriach
Copy link
Contributor

Typo in PR title: determinisim -> determinism

@duncanriach
Copy link
Contributor

duncanriach commented Mar 12, 2021

@gbaned: please let me review this PR first. Then we'll move it out of WIP/draft and then it can be further reviewed, if considered necessary.

Tagging @sanjoy for visibility.

Copy link
Contributor

@duncanriach duncanriach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review completed. I'm happy to chat with you, one-on-one, about any of this.

After you have addressed these comments, I'll review again, looking deeper.

with self.session(force_gpu=True):

if data_type in unimplemented_types:
with self.assertRaises(errors.UnimplementedError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use self.assertRaisesRegex to check the exception message.

return _timeit


def sparse_tensor_dense_vs_dense_matmul_benchmark(thresh,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this function, and the previous two private functions that it calls, should be left in their original location in sparse_tensor_dense_matmul_op_test.py. They should be kept with the code that calls them, to minimize complexity. I don't see any reason to put them in the base file and make them available to the deterministic tests.

@@ -1,4 +1,4 @@
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to 2015-2021 after moving the benchmark code back in here.

/*default_val=*/false,
&deterministic_ops));
if (deterministic_ops) {
OP_REQUIRES(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should not be run here, because it will run on either a CPU or a GPU. You need to find a place in the codepath (in the CUDA kernel launch code, and not inside the CUDA kernel) where the code will definitely only be running on a GPU.

It seems as though you've added the integral type check to prevent exceptions being thrown on CPU. This will not cover the case where the op is forced to run on CPU, and the type is float64 or complex128, which presumably do operate deterministically on the CPU. Therefore this code will throw incorrect exceptions on CPU.

By making this check GPU-only, it will simplify it, and reduce complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a TF standard to put OP_REQUIRES in the *.cc file instead of *.cu.cc otherwise compiling error occurs. Reason is, the functor::compute in *.cu.cc returns a Status but compute in *.cc returns void. If there is a OP_REQUIRES branch in the functor::compute, a return value is mandate.

Copy link
Contributor

@duncanriach duncanriach Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, depending on the way the op is coded, to use OP_REQUIRES and OP_REQUIRES_ASYNC in the CUDA kernel launch code in the *.cu.cc file, and some ops do this. It's also possible to add the error to the context, as some other ops do. If, as in this particular op, the signature of the per-device implementation functor returns Status, then it suggests that we need to work with that pattern (set an error as well as, or via, returning a Status). The objectives have to be met while working within reasonable constraints, but we should not default to introducing bugs, reducing performance significantly, or introducing unnecessary complexity because one apparent pattern seems to conflict with another apparent pattern. It's sometimes necessary to wrestle a little bit longer with this kind of thing and/or seek consultation in order to meet the requirements of what needs to be implemented.

Copy link
Contributor

@duncanriach duncanriach Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One part-way solution, could be to put the OP_REQUIRES into the existing if (std::is_same<Device, GPUDevice>::value) block in the *.cc file; this would simplify and accelerate. The RequireDeterminism function could be defined on the *.cu.cc file, declared in a shared header, and reused. However, I think it's better (simpler/faster) to check for RequireDeterminism() once, in one place (in the *.cu.cc file), assuming a good way to do that can be figured out.

Copy link
Contributor

@duncanriach duncanriach Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like maybe we just have to return errors:Unimplemented(...).

Looking at, for example, barrier_ops.cc, that contains functions which return Status and throw exceptions.

Copy link
Contributor

@duncanriach duncanriach Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the status is passed up the stack like that, a higher level handles it (with OP_REQUIRES_OK), putting it into the context. If it's not passed up (as in functions that return void) the OP_REQUIRES puts it into the context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary: we either throw the exception or we pass it up via returned Status.

Copy link
Contributor

@duncanriach duncanriach Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up from phone conversation with @wenscarl: I had missed (the context) that this code was already inside the if (std::is_same<Device, GPUDevice>::value) block, so the "part-way" solution, that I mentioned above, is how it already was.

errors::InvalidArgument(
"Cannot use GPU when output.shape[1] * nnz(a) > 2^31"));
bool deterministic_ops = false;
TF_CHECK_OK(ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be checked every time the op compute runs. Please follow the existing pattern by caching the environment variable in a static variable inside a function.

Given that this check will/should move down into the _gpu.cu.cc code, I don't think you'll need to do it twice, you can do the exception check/throw in the same place as choosing the deterministic functionality. So this pattern here should be eliminated.

EIGEN_ALWAYS_INLINE EIGEN_DEVICE_FUNC bool DeterminismTypesCheck() {
return TF_PREDICT_TRUE(
std::is_integral<T>::value ||
(!std::is_same<T, double>::value && !std::is_same<T, complex128>::value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integral check should be eliminated.

In the kernel-launching code, in the RequireDeterminism() == true branch, there are two options: (1) if std::is_same<T, double>::value || std::is_same<T, complex128>::value then throw the exception, else (2) do the deterministic kernel launch.

That should not need to be put in a function or inlined, and I think it's going to be really easy to understand.

@wenscarl
Copy link
Contributor Author

@googlebot I signed it!

@google-cla google-cla bot added cla: yes and removed cla: no labels Mar 12, 2021
Copy link
Contributor

@duncanriach duncanriach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have completed a second review pass. Please make changes and then I will review again, possibly going deeper.


template <typename Tsrc, typename Tdst>
__global__ void DownCast(
const int size, const Tsrc* src, Tdst* __restrict__ dst) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if the compiler's default assumption of pointer aliasing could slow down the operation in these DownCast functions (but it's better to be safe and to let it know that there is no pointer aliasing), however, to avoid pointer aliasing, doesn't __restrict__ have to be applied to both src and dst? I just reviewed this blog post (again).


with self.assertRaisesRegex(
errors.UnimplementedError,
"No deterministic GPU implementation of *"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reason for not checking that all the words/characters in the message are correct?

To me, it seems better to check it completely. If the message changes, either intentionally or accidentally, this test will then fail, and it will bring conscious attention to this being tested.

Note that self.assertRaisesRegex can test against a regular expression (as you have used it), but it can also be used to compare against a string literal (a type of non-variable regular expression).

required_determinisim = \
True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False

unimplemented_types = (np.float64, np.complex128)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this variable name should be more explicit because in the context of this test code, "unimplemented types" means that matmul is not implemented at all for those types, which is not what this variable represents. What this variable is capturing is the types for which GPU-determinism is not implemented. Therefore, an unambiguous name for it would be gpu_determinism_unimplemented_types, otherwise its meaning can only be discerned by carefully examining how it's later used, and, ideally, we want to be able to read and understand code without having to decipher variable meaning by mentally simulating variable use.

Variables should be named such that, in the contexts in which they appear, the meaning of their contents is unambiguous.

Comment on lines 265 to 318
def _sparse_tensor_dense_vs_dense_matmul_benchmark_dense(x, y, adjoint_a,
adjoint_b):

def body(t, prev):
with ops.control_dependencies([prev]):
return (t + 1, math_ops.matmul(
x,
y,
transpose_a=adjoint_a,
transpose_b=adjoint_b,
a_is_sparse=True,
b_is_sparse=False))

t0 = constant_op.constant(0)
v0 = constant_op.constant(0.0)

def _timeit(iterations, _):
(_, final) = control_flow_ops.while_loop(
lambda t, _: t < iterations,
body, (t0, v0),
parallel_iterations=1,
back_prop=False,
shape_invariants=(tensor_shape.TensorShape(()),
tensor_shape.TensorShape(None)))
return [final]

return _timeit


def _sparse_tensor_dense_vs_dense_matmul_benchmark_sparse(x_ind, x_val, x_shape,
y, adjoint_a,
adjoint_b):
sp_x = sparse_tensor.SparseTensor(
indices=x_ind, values=x_val, dense_shape=x_shape)

def body(t, prev):
with ops.control_dependencies([prev]):
return (t + 1, sparse_ops.sparse_tensor_dense_matmul(
sp_x, y, adjoint_a=adjoint_a, adjoint_b=adjoint_b))

t0 = constant_op.constant(0)
v0 = constant_op.constant(0.0)

def _timeit(iterations, _):
(_, final) = control_flow_ops.while_loop(
lambda t, _: t < iterations,
body, (t0, v0),
parallel_iterations=1,
back_prop=False,
shape_invariants=(tensor_shape.TensorShape(()),
tensor_shape.TensorShape(None)))
return [final]

return _timeit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to this comment in my previous review cycle. These two functions are essentially subroutines of the benchmark function, which is not referenced in the base file or the deterministic test file at all. These functions should be returned to their original locations in sparse_tensor_dense_matmul_op_test.py.

Comment on lines 72 to 73
required_determinisim = \
True if os.getenv('TF_DETERMINISTIC_OPS') in ('1', 'True') else False
Copy link
Contributor

@duncanriach duncanriach Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to section 3.2 (Line Length) of the Google Python Style Guide: "Do not use backslash line continuation except for with statements requiring three or more context managers. Make use of Python’s implicit line joining inside parentheses, brackets and braces. If necessary, you can add an extra pair of parentheses around an expression."

There is another instance of incorrect use of backslash for line continuation in the code below.

Also this variable name includes an incorrect spelling of "determinism." Please call it, determinism_required.

Comment on lines 46 to 48
GPU_1D_KERNEL_LOOP(index, size) {
dst[index] = static_cast<Tdst>(src[index]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line of code should be implemented with multiples of two (not four) spaces: outer=2, inner=4.

Comment on lines 150 to 151
DataTypeToEnum<Tupcast>::value,
TensorShape({out.dimension(0), out.dimension(1)}), &temp_out_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code should be indented with four spaces, not five.

Comment on lines 160 to 168
SparseTensorDenseMatMulKernel<T, Tupcast, Tindices, ADJ_A, ADJ_B>,
config.block_count, config.thread_per_block, 0, d.stream(), nnz, m,
b_rows, b_cols, p, a_indices.data(), a_values.data(), b.data(),
maybe_temp_out_data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code should be indented with four spaces, not three.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please will you review all the code that you're adding and make sure that it follows the intending rules. Each level of statement indent should add two spaces. Starting a list of parameters on a new line [ending the previous line with a "("] should indent four spaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might seem pedantic, but it's important to generate code in a standard way to make it easy to read and maintain. Search for, and review, all 30 instances of the word "indent" in the Google C++ Style Guide.

Comment on lines 3229 to 3235
cuda_py_test(
name = "sparse_tensor_dense_matmul_op_deterministic_test",
size = "small",
srcs = ["sparse_tensor_dense_matmul_op_deterministic_test.py"],
deps = [
":sparse_tensor_dense_matmul_op_base",
],
)

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines 2 to 8
/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these three blank lines should be removed from the copyright header. They're part of the standard copyright header format/structure.

Comment on lines +107 to +123
bool RequireDeterminism() {
static bool require_determinism = [] {
bool deterministic_ops = false;
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",
/*default_val=*/false,
&deterministic_ops));
return deterministic_ops;
}();
return require_determinism;
}

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

@duncanriach duncanriach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third review pass completed. Please review my comments and revise. Let me know when you're done by requesting a re-review. I'm happy to discuss any of this with you.

}

template <typename Tout, typename Tin>
__device__ void reduction(Tout* out_location, Tin val){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should always be a space between the ) and the {. See here. I don't know if these kind of formatting issues are currently auto-fixed, break a CI lint check, or just slip through, but I think it's worth fixing them up-front. Please check that this formatting error is not repeated elsewhere in your PR. I see it in the next function definition, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has been addressed.

name = "sparse_tensor_dense_matmul_op_test",
size = "medium",
srcs = ["sparse_tensor_dense_matmul_op_test.py"],
tfrt_enabled = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I'm wondering how this got into the PR in the first place.

}

template<>
__device__ void reduction(complex128* out_location, complex64 val){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function names should start with a capital letter; see here.

Reduction is a relatively generic and non-descriptive noun. What does this function actually do? It re-casts and then adds. Following the existing name, how about GpuRecastAtomicAdd? This function recast-adds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has been addressed.

GpuAtomicAdd(out_location,
static_cast<Tsum>(a_value) * static_cast<Tsum>(b_value));

reduction<Tsum, T>(out_location, a_value * b_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that you haven't replaced the call to GpuAtomicAdd earlier in this function. Do you have a reason for or against that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier GpuAtomicAdd doesn't need to do the recast because the template passed in is Tsum already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and understand that now. Thank you.

// Note: The reinterpret cast is only required to avoid a compilation
// error; it is only used if Tsum == T.
maybe_temp_out_data = reinterpret_cast<Tsum*>(out.data());
out.device(d) = out.constant(T(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this code is using this slower form of zeroing. Perhaps this should be made to run faster (?) by using SetZero (as in the other branch of the conditional).

Copy link
Contributor Author

@wenscarl wenscarl Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just nobody catches that. Fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has now been addressed.

maybe_temp_out_data));

if (sum_type_is_different) {
out.device(d) = temp_out_t.matrix<Tsum>().template cast<T>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this line of code basically the same as our DownCast function, but possibly slower?

Isn't the only difference between the deterministic and nondeterministic codepaths in the choice of Tsum?

Could this code be re-written more simply to run exactly the same way for both regular and deterministic operation except that for regular operation, Tsum = typename SumType<T>::type while for deterministic operation, Tsum = typename SumType<T>::type_for_determinism?

Maybe I'm missing some other difference between the two code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo my earlier thought. This line of code is designed to handle fp16 particularly. For complex type, temp_out_t.matrix<Tsum>().template cast<T>() cannot handle properly and has to go through our DownCast kernel. But we do can uniform the definition of Tsum.

Copy link
Contributor

@duncanriach duncanriach Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The DownCast kernel will perform the same functionality, and more, and with higher performance (we believe). My question is why not replace this line with a DownCast launch.

The larger point here is that both the regular and deterministic paths in SparseTensorDenseMatMulFunctor do the same thing. The only difference is the choice of Tsum. I'm wondering why these two codepaths cannot be merged into one codepath. As it stands, it looks like the same codepath replicated.

Merging the code paths requires some careful thought, of course, but it seems doable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular codepath handles Tsum == T, which is fine. This goes beyond what the deterministic codepath needs. In the implemented deterministic modes, Tsum != T, and the regular codepath handles that case too. It seems that the regular codepath can be augmented very slightly to support the deterministic functionality as well. Let me know if you would like to discuss and I can go through this with you.

Comment on lines 74 to 75
if (len(gpus) > 0 and determinism_required and
x.dtype in (np.float64, np.complex128)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To limit the amount of determinism-related stuff in this base test, you could replace len(gpus) > 0 and determinism_required with self.__class__ == SparseTensorDenseMatmulDeterministicTest. This will remove the requirement for the three lines of code above that set determinism_required and will make this base-test code agnostic to how determinism is enabled (or if it's enabled at all).

Also (np.float64, no.complex128) can be replaced with self._getGpuDeterminismUnimplementedTypes(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class, which returns (np.float64, no.complex128) and can also be used by (called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the list of unimplemented types will only be defined once, in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 79 to 80
"No deterministic GPU implementation of sparse_dense_matmul "
"available for data of type tf.float64 or tf.complex128"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string could be returned by self._getGpuDeterminismUnimplementedErrorString(), a private function, defined only in the SparseTensorDenseMatmulDeterministicTest class that returns the string and can also be used by (and called from) testDeterministicSparseDenseMatmul in SparseTensorDenseMatmulDeterministicTest. This will result in the string only being defined once, in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has now been addressed.

Comment on lines 68 to 98
gpu_determinism_implemented_types = (np.float16, np.float32, np.complex64)

for data_type in gpu_determinism_implemented_types:
sparse_input, dense_input = self._gen_data(
m=2430, k=615, n=857, nnz=(1<<16)+243, row_occupied_rate=0.02,
data_type=data_type)

repeat_count = 5
with self.session(force_gpu=True):
if data_type in (np.float64, np.complex128):
with self.assertRaisesRegex(
errors.UnimplementedError,
"No deterministic GPU implementation of sparse_dense_matmul "
"available for data of type tf.float64 or tf.complex128"):
result_ = sparse_ops.sparse_tensor_dense_matmul(
sparse_input, dense_input)
self.evaluate(result_)
else:
result_a = sparse_ops.sparse_tensor_dense_matmul(sparse_input,
dense_input)
for _ in range(repeat_count):
result_b = sparse_ops.sparse_tensor_dense_matmul(sparse_input,
dense_input)
self.assertAllEqual(result_a, result_b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will never test the exception cases. We want to test implemented and unimplemented types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has now been addressed.

@duncanriach
Copy link
Contributor

duncanriach commented Mar 24, 2021

@wenscarl: This is feedback about your commit messages. The following three commits do much more than what the associated messages suggest:

The first line of a commit message should summarize what the commit does, not list one of the many things it does. In cases like this, where each commit is addressing a review step, I recommend,

[determinism] Address review, step n, on PR 47749,

replacing n with the review step that is being addressed by the commit. We're currently on review step 4 for this PR; your next commit will be for step 4.

These commit messages should be meaningful so that the list of commits in the PR make sense and also so that the commits make sense in the commit log of the master branch after the merge (assuming the commits don't get squashed).

sp_x_value = sparse_tensor.SparseTensorValue(
indices=x_indices, values=x_values, dense_shape=x_shape)

if (self.__class__ != SparseTensorDenseMatMulTestBase and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution to the problem of the SparseTensorDenseMatmulDeterministicTest not being available in this scope.

This solution is a little brittle however because in that it relies on SparseTensorDenseMatmulTest not inheriting from SparseTensorDenseMatMulTestBase (as SparseTensorDenseMatmulDeterministicTest does) but instead SparseTensorDenseMatmulTest being a direct reference to SparseTensorDenseMatMulTestBase.

An alternative way of checking if this code is running inside an instance of the SparseTensorDenseMatmulDeterministicTest class, without having access to its token, is

self.__class__.__name__ == "SparseTensorDenseMatmulDeterministicTest"

Another option is to test for duck type:

callable(getattr(self, "_getGpuDeterminismUnimplementedTypes", None))

Comment on lines 79 to 80
"No deterministic GPU implementation of sparse_dense_matmul "
"available for data of type tf.float64 or tf.complex128"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this has now been addressed.

Copy link
Contributor

@duncanriach duncanriach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review step 4. There's also still a typo in the PR title; see this earlier comment. We're also still discussing some items from the previous review step.

@wenscarl wenscarl changed the title WIP:[determinism] Add GPU determinisim for fp types in GPU SparseTensorDenseMatMul WIP:[determinism] Add GPU determinism for fp types in GPU SparseTensorDenseMatMul Mar 24, 2021
@wenscarl wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from 8f53dfc to d11a6f8 Compare March 25, 2021 00:02
@duncanriach
Copy link
Contributor

duncanriach commented Mar 25, 2021

@wenscarl, there is no review step 0. I reviewed it four times: 1, 2, 3, 4. The first time I reviewed this PR was step 1, or pass 1, or round 1.

768a1a6 addresses review (step 1) - there is no step zero in the review.
9ec0be6 fixes a typo - your original commit message made sense
1f6f4b6 addresses review (step 2)
a1398b2 addresses review (step 3)
d11a6f8 addressees review (step 4) but it also, arguably, changes the code in a way that could be summarized in one line.

@wenscarl wenscarl force-pushed the determinism-gpu-SparseTensorDenseMatmul-fp-types branch from d11a6f8 to 94cac89 Compare March 25, 2021 01:58
@gbaned gbaned requested a review from sanjoy April 1, 2021 02:35
@gbaned gbaned removed kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 1, 2021
@gbaned
Copy link
Contributor

gbaned commented Apr 7, 2021

@wenscarl Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Apr 7, 2021
@wenscarl
Copy link
Contributor Author

wenscarl commented Apr 14, 2021

@wenscarl Can you please resolve conflicts? Thanks!

@gbaned The conflict is resolved. Please move forward. Thanks!

@gbaned gbaned removed the stat:awaiting response Status - Awaiting response from author label Apr 21, 2021
@gbaned gbaned requested a review from reedwm April 21, 2021 18:56
@gbaned gbaned added the awaiting review Pull request awaiting review label Apr 21, 2021
@reedwm
Copy link
Contributor

reedwm commented Apr 23, 2021

I only skimmed this PR so far, but it appears it achieves determinism by summing float16 and float32 values with a float64 accumulator, using AtomicAdd. But I think this is still nondeterministic in certain cases. For example, suppose you sum the following float32 numbers but with a float64 accumulator: 1 + -1 + 2**-60. If you compute (1 + -1) + 2**-60, you get 2**-60, but if you compute 1 + (-1 + 2**-60), you get 0.

I'm trying to think of a way to resolve this, but I haven't thought of anything yet. @wenscarl @duncanriach @sanjoy any ideas?

@ebrevdo
Copy link
Contributor

ebrevdo commented Apr 23, 2021 via email

@ebrevdo
Copy link
Contributor

ebrevdo commented Apr 23, 2021 via email

@reedwm
Copy link
Contributor

reedwm commented Apr 23, 2021

That's a good suggestion, but unfortunately even Kahan summation can cause nondeterminism :(

In particular, consider summing 2**60 + 1 + 2**-60 - 1 - 2**60. The true result is 2**-60, which can be obtained with float32 arithmetic by rearranging the terms. But if summed without rearranging the terms, even with float64 accumulation and Kahan summation, results in 0. The issue is Kahan summation only stores two floating-point accumulators to represent the sum, but there are three numbers with very different magnitudes in the summation.

@sanjoy sanjoy removed their request for review April 26, 2021 03:39
@duncanriach
Copy link
Contributor

duncanriach commented Apr 26, 2021

@reedwm. This is really good, and valid, challenge to this approach. Thank you. I didn't think of this when I challenged it, and then studied it, on PR 39751.

I hope we can find a solution.

@duncanriach
Copy link
Contributor

@ebrevdo, thanks for the suggestion. I had to look up Kahan summation. To extend what @reedwm said, the challenge is not necessarily to minimize error (which is what Kahan summation attempts to do) but to get the same error no matter what the order of operations is. It seems to me that it would be tricky and/or very costly to parallelize Kahan summation in any case. In its serial form, Kahan summation would be deterministic, but so would normal floating-point summation.

BTW, @wenscarl and I worked on other approaches to making this op deterministic, approaches that ensure reproducible ordering of operations, but we found those to be much slower (after a certain amount of time spent tuning them), and harder to make deadlock-safe, than this one. Hopefully we can make this one work. However, it does seem to have a fundamental flaw.

@reedwm
Copy link
Contributor

reedwm commented Apr 26, 2021

I tried thinking of a solution, but unfortunately have not been able to think of any way to make this op deterministic on the GPU. What approaches have you and @wenscarl come up with?

The only solution I know of would be to run the op on the CPU when determinism is enabled, but this might have significant performance impacts and unfortunately would also mean we could not use most of this PR (we could keep the tests, modifying them to only run on the CPU)

@duncanriach
Copy link
Contributor

duncanriach commented Apr 26, 2021

The other solutions are variations on the following algorithm, which @wenscarl came up with:

With each thread-block, each thread writes its matmul result into shared memory. Then bitonic sort (using all threads in the block) deterministically sorts the results into groups with matching output addresses. Then the right-most thread (the accumulator thread) in each group serially (and therefore deterministically) reduces the results for the group and uses CUDA atomicAdd to write the result out to memory. All the accumulator threads in the same thread-block will be, by definition, writing to different addresses and therefore there is no concern about nondeterministic ordering of those atomic additions, at least from within the same thread-block. On the other hand, multiple thread-blocks may be writing to the same address using CUDA atomicAdd. To ensure a deterministic result in this case, the accumulator threads wait for their thread-block’s turn to write to memory using a GPU memory semaphore based on the thread-block ID.

Variations include inter-thread-block synchronization mechanism details, whether to use atomics, and different approaches to intra-thread-block reduction.

@duncanriach
Copy link
Contributor

duncanriach commented Apr 26, 2021

Note that the approach that is currently in this PR, the approach that seems to be fatally flawed, was developed when the op only supported tf.float32 on GPU. The addition of support for tf.float64 and tf.complex128 on GPU meant that this approach could not be extended to all GPU data types. An advantage of switching to another approach is that it would almost certainly extend to all data types supported on GPU.

Switching to a different approach will likely take significant extra development time, even if it's one that we've already explored.

It seems to me that defaulting to running an op on CPU is an alternative approach to throwing exceptions; I'm wondering if it's a better approach (there are pros and cons either way). I think that, for now, we should stick with the plan of throwing exceptions (rather than running on CPU) and consider revisiting the plan.

@duncanriach
Copy link
Contributor

Thoughts on the following plan, @reedwm?

Leave this PR open for a few days, maybe a week, to see if a solution bubbles up for the apparently unsolvable problem with this approach.

If no way of recovering this approach arises, then we close this PR to memorialize the dead-end without painting over it.

Then either I or @wenscarl will open either:

  1. a PR to add throwing of GPU determinism-unimplemented exceptions for this op or
  2. a PR to add a GPU-deterministic implementation for this op via another approach.

Or possibly (1) followed by (2).

@reedwm
Copy link
Contributor

reedwm commented Apr 27, 2021

That sounds good to me. Agreed we should not go with running on the CPU for now; we can reconsider in the future.

@benbarsdell
Copy link
Contributor

In terms of an alternative approach, it might be possible to recast this op as something that looks a bit like a weighted sparse segment reduction. If so, the deterministic GPU kernel in #47974 might be useful. (Actually I already have a follow-up PR to that one that adds weights in order to support the Grad implementation).

@duncanriach
Copy link
Contributor

duncanriach commented May 6, 2021

Okay, @reedwm. Please will you close this PR. The next step (see the plan above) is (for me) to implement exception-throwing for GPU kernels.

@reedwm reedwm closed this May 6, 2021
@reedwm
Copy link
Contributor

reedwm commented May 6, 2021

Are the tests from this PR still usable? If the CPU version is deterministic, the tests can be added now, to be run on the CPU only. Otherwise, they can be added once either the CPU or GPU version has a deterministic version.

@duncanriach
Copy link
Contributor

duncanriach commented May 7, 2021

The CPU version is deterministic, although this PR did not run the tests on CPU (because it was focused on providing GPU-determinism). The PR that adds the exception-throwing for this op will also reuse the determinism tests from this PR and run them on the CPU as well as the exception tests from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Pull request awaiting review cla: yes size:XL CL Change Size:Extra Large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants