RFC: Enhancing determinism in TF #346

pkanwar23 · 2021-01-20T05:04:01Z

Open for comment until 2/4/2021

Status	Proposed
Author(s)	Pankaj Kanwar (Google), Reed Wanderman-Milne (Google), Duncan Riach (NVIDIA)
Sponsor	Sanjoy Das (Google)

Objective

Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.

To get deterministic behavior, users must do the following:

Enable determinism using the API proposed in this doc
Use same hardware every run
Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).
Not use constructs outside TensorFlow that are non-deterministic, such as Python’s random module or using multiple threads/processes in ways that influence TensorFlow’s behavior

Creating RFC for determinism in TensorFlow

duncanriach

Some small enhancements and/or clarifications.

rfcs/20210119-determinism.md

duncanriach · 2021-01-21T21:40:49Z

rfcs/20210119-determinism.md

+
+* `tf.nn.softmax_cross_entropy_with_logits` 
+* `tf.nn.sparse_softmax_cross_entropy_with_logits` 
+* `tf.image.resize` with method=ResizeMethod.NEAREST 


for which one?

The comment is on line 69, and referring to that op: tf.image.resize with method=ResizeMethod.NEAREST.

Note that the other two ops are loss functions, so (in training) their back-prop paths aways end up getting used. I believe, though I'm not totally certain, that the nondeterminism is introduced in the gradient of those functions too, but I didn't mention it because of this fact that they'll always be used in real models with the gradient (at least in training, even if the model is later used only in inference with no gradient).

When the exception-throwing code is added, it may end up being only added for the gradient kernel because that's the only path that is actually injecting nondeterminism. Whatever the functionality it is, the documentation will be updated to match.

rfcs/20210119-determinism.md

ematejska · 2021-01-26T17:44:08Z

Adding @rohan100jain to review from the api owners side.

Updating the RFC based on comments from NVIDIA.

rohan100jain · 2021-02-03T21:16:49Z

rfcs/20210119-determinism.md

+## Objective
+Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.
+
+To get deterministic behavior, users must do the following:


One question here is whether Keras is in scope here as well? I believe in some conversations with @fchollet that there might be some non-determinism introduced in the Keras framework as well

So the other source of randomness in Keras (besides variable initializers) is going to be layer calls (like dropout).

We intend these to rely on stateless ops, seeded by a seed argument in call. That seed would be either provided by a per-model stateful RNG when calling the model, or if left unspecified would be autogenerated by a global RNG.

Both variable initializers and layers currently use legacy random ops, which are made deterministic when enabling determinism (see section "Random ops"). Stateless ops are always deterministic. So Keras will be OK.

In support of what Reed wrote above, in my experience, all nondeterminism seen (and resolved) when using Keras have been due to underlying issues with non-keras TensorFlow.

rohan100jain · 2021-02-03T21:32:01Z

rfcs/20210119-determinism.md

+* TF_DETERMINISTIC_OPS
+* TF_CUDNN_DETERMINISTIC
+
+tf.data also has flags for determinism. The system will throw an error message if flags are out of sync i.e. if deterministic_execution_enabled is enabled but if the tf.data option is set to ‘false’, we will throw an error. (`tf.data.Options.experimental_deterministic`). We’ll also add the necessary checks for Dataset.map and Dataset.interleave. See the [Random ops](#random-ops) section for how random Datasets, such as `tf.data.experimental.RandomDataset`, are handled.


Would we also do these inconsistency checks for the 2 environ variables mentioned above?

No, because unlike the tf.data flag, these environmental variables are treated as False by default. If users enable deterministic using this API, we don't want to require users to also set environmental variables to avoid an error.

Also, the environment variables will be deprecated, and their functionality will not be further enhanced.

rohan100jain · 2021-02-03T21:41:20Z

rfcs/20210119-determinism.md

+* Enable determinism using the API proposed in this doc.
+* Use same hardware in every run.
+* Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).
+* Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior.


Custom ops? What op set is the determinism guarantee for?

Good point. Since we do not control custom ops, users must ensure custom ops (if any) are deterministic.

duncanriach · 2021-02-04T22:58:20Z

rfcs/20210119-determinism.md

+
+
+## Objective
+Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.


I think this should be "Allow users to enable deterministic behavior in TensorFlow."

duncanriach · 2021-02-04T23:03:25Z

rfcs/20210119-determinism.md

+To get deterministic behavior, users must do the following:
+
+* Enable determinism using the API proposed in this doc.
+* Use same hardware in every run.


This could be misleading. You don't have to use exactly the same hardware instance, just the same hardware configuration. So, in the single accelerator case, that might be the same accelerator silicon architecture. I assume that we're not addressing multi-accelerator determinism here. That's another topic.

I suggest: Use the same hardware configuration in every run.

duncanriach · 2021-02-04T23:06:13Z

rfcs/20210119-determinism.md

+
+* Enable determinism using the API proposed in this doc.
+* Use same hardware in every run.
+* Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).


In the case of GPUs, this includes CUDA version, cuDNN version, and also (even, possibly) driver version. A change in any of these parameters between runs could lead to non-bit-exactness between runs.

Added CUDA version. Didn't add others for brevity (they are covered by the "etc")

duncanriach · 2021-02-04T23:11:28Z

rfcs/20210119-determinism.md

+* Enable determinism using the API proposed in this doc.
+* Use same hardware in every run.
+* Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).
+* Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior.


Python's random module is not nondeterministic. As far as I know, it supports full deterministic functionality, but to enable determinism the user just has to initialize it (as she would for any other compute process from which she desired deterministic functionality).

Suggested change: Not use constructs outside TensorFlow that are nondeterministic, such as Python’s random module (without appropriate use of its ability to be initialized/seeded) or using multiple threads/processes in ways that influence TensorFlow’s behavior.

Used a slightly shorter version.

duncanriach · 2021-02-04T23:14:10Z

rfcs/20210119-determinism.md

+* Enable determinism using the API proposed in this doc.
+* Use same hardware in every run.
+* Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).
+* Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior.


There is at least one case (that I'm aware of) where third-party multi-GPU training systems (in this case Horovod) can introduce nondeterminism unless they are configured specifically to operate deterministically.

Assuming such systems have a dependency on TensorFlow, they should query deterministic_execution_enabled and if it's true, either enable determinism or raise a warning/error. I don't think it's worth mentioning here.

duncanriach · 2021-02-04T23:15:37Z

rfcs/20210119-determinism.md

+* Do not use nondeterministic custom ops.
+
+## Motivation
+There are several mission critical applications in life sciences, finance and automation that require deterministic behavior. Determinism is required so that the behavior of these applications can be accurately predicted & demonstrated in a variety of scenarios.


I would change life sciences to medicine.

duncanriach · 2021-02-10T05:04:21Z

rfcs/20210119-determinism.md

+
+However, supporting a subset of CPU ops is problematic: what if a rewrite pass converts Op A (which supports determinism) to Op B (which does not). Similarly, Placer may be modified in the future to place some small ops on the CPU by default instead of the GPU, which can break determinism. In general, modifications to TensorFlow that affect ops can potentially break determinism, and therefore break backwards compatibility.
+
+We don’t want TensorFlow developers to have to worry about breaking determinism when modifying TensorFlow. We could potentially allow a model to start raising a determinism error in minor releases of TF, but this is a bad user experience. Alternatively, we could rely on unit tests to catch cases where developers break determinism. Another alternative is to fully support determinism on the CPU. I and others will try to think of other ways to avoid developers inadvertently breaking determinism when modifying TensorFlow.


An op either operates nondeterministically, or throws and exception. If we miss an exception, then the nondeterminism will prevent a user from running their model deterministically with determinism enabled. So adding a valid exception throw will not realistically break anything.

When deterministic functionality is added, it's usually added with a test to confirm deterministic functionality of the op (at the op level). Testing of deterministic gradients is done using an approach that I call gradient injection.

duncanriach · 2021-02-27T01:22:01Z

rfcs/20210119-determinism.md

+
+The first function takes in a boolean value, and allows the model developer to enable/disable determinism. The second function returns a bool indicating whether determinism is enabled.
+In some cases, we have deterministic and nondeterministic versions of the kernel. In such cases, we will use this flag to run the appropriate kernels.
+For ops which do not yet have a deterministic implementation, TensorFlow will raise a `tf.errors.UnimplementedError` if the flag is enabled.


I wonder if it's worth clarifying here:

When deterministic execution is enabled, tf.errors.UnimplementedError will be thrown if a nondeterministic code path through an op would otherwise be traversed (either in eager mode or during graph execution). Note that it may still be possible to simply construct a graph containing ops that have nondeterministic code paths through them without the error being thrown.

I added the sentences

Certain ops will only raise an error for certain input shapes or attributes. Depending on the op, in graph mode, the error will either be raised when the op is constructed or when the op is run.

The first sentence addresses your first point, as it implies that only certain nondeterministic codepaths raise errors. For you second point: it depends on whether the op raises an error when it is constructed or run. I think both are acceptable, so in the RFC I left this up to each op's author.

The RFC was changed so that only individual ops are deterministic, not TensorFlow as a whole. This limits the scope of the API, while still allowing users to write deterministic models. Additional changes: * Add "Testing plan" section * Add original RFC's API to "Alternatives considered" * Add more ops to "Op Review and changes" section * Raise FailedPreconditionError when op is fundamentally not deterministic, not a NotImplementedError * Remove discussion section, since it doesn't apply to the new version of the RFC. There is a link to the old version of the RFC (with the discussion section) in "Alternatives considered" * Alphabetize author list * Move some info from "Objective" to "Design proposal" * Grammar fixes

reedwm · 2021-03-11T02:18:23Z

I significantly changed the RFC so that enabling determinism only affects ops, not all of TensorFlow. I also made various other changes. No one else other than me has reviewed the updated RFC yet so @pkanwar23 @duncanriach @sanjoy, please take a look.

mgouicem · 2021-03-11T22:32:08Z

Hi @reedwm , in the updated RFC, there is no mention about what should happen with the support in CPU implementations, and about the performance implications of this feature. Could you please clarify the direction for both these points.
As mentioned above, most compute ops (e.g. matmul, convolution) are non-deterministic currently when run multi-threaded.

Tagging @penpornk @agramesh1.

reedwm · 2021-03-11T23:06:23Z

Hi @reedwm , in the updated RFC, there is no mention about what should happen with the support in CPU implementations, and about the performance implications of this feature. Could you please clarify the direction for both these points.
As mentioned above, most compute ops (e.g. matmul, convolution) are non-deterministic currently when run multi-threaded.

My impression was that Eigen was deterministic and that matmul/convolutions use Eigen, and so that CPU work would be fairly simple. But I haven't actually verified this through testing yet, as I've only tested GPU models. Do you know the source of nondeterminism of matmuls/convolutions? /CC @ezhulenev as well.

duncanriach · 2021-03-11T23:40:41Z

rfcs/20210119-determinism.md

+
+The lack of determinism in certain ops prevents companies from launching products using models developed in TF. For a subset of these industries having deterministic behavior is a regulatory requirement.
+
+In addition, deterministic ops increases model velocity development by reducing noise, while also simplifying the debugging workflow.


increases -> increase

Alternatively, suggest: "deterministic functionality, enabled by deterministic ops, increases"

duncanriach · 2021-03-11T23:50:51Z

rfcs/20210119-determinism.md

+
+The first function takes in a boolean value, and allows the model developer to enable/disable deterministic ops. The second function returns a bool indicating whether deterministic ops is enabled.
+
+Once enabled, every built-in op will either be made deterministic or raise an error if determinism is not supported. For ops which we have not yet implemented a deterministic version, a `NotImplementedError` will be raised. In the long term, we plan on adding a deterministic version to all such ops. For ops which are inherently nondeterministic such as `tf.random.normal` without a seed, a `FailedPreconditionError` will be raised (the precondition being that determinism must be disabled). Certain ops will only raise an error for certain input shapes or attributes. Depending on the op, in graph mode, the error will either be raised when the op is constructed or when the op is run.


"For ops which we have not yet implemented a deterministic version, a NotImplementedError will be raised" (grammar) -> "A tf.errors.UnimplementedError will be raised by ops for which we have not yet implemented a deterministic version."

Also, throughout, NotImplementedError -> tf.errors.UnimplementedError

"Certain ops will only raise an error for certain input shapes or attributes." -> "Some ops will only raise an error on a subset of input shapes, attributes, data types, or data-paths through the op."

duncanriach · 2021-03-11T23:52:54Z

rfcs/20210119-determinism.md

+The API allows users to write deterministic models. To do so, users must:
+
+* Enable deterministic ops with `tf.config.enable_deterministic_ops`.
+* Use same hardware configuration in every run.


"Use same hardware configuration ..." -> "Use the same hardware configuration ..."

duncanriach · 2021-03-11T23:54:34Z

rfcs/20210119-determinism.md

+
+* Enable deterministic ops with `tf.config.enable_deterministic_ops`.
+* Use same hardware configuration in every run.
+* Use the same software environment every run (OS, checkpoints, version of CUDA and TF, environmental variables, etc).


"Use the same software environment every run ..." -> "Use the same software environment in every run ..." in or on - consistent with last point.

duncanriach · 2021-03-12T00:08:04Z

rfcs/20210119-determinism.md

+
+1. We will add tests to several of the [official models](https://github.com/tensorflow/models/tree/master/official) to ensure they run deterministically. In particular, each test will train a model for several steps, then retrain it from scratch several times. The final weights after training will be asserted to be the same each time. This tests not only the `enable_deterministic_ops` API but that the entire model is deterministic. This only tests ops that the official models use.
+
+2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism.


Most forward tests are too small and/or not designed in some other way to test for determinism. For example, many tests use integers even when exercising floating-point data-paths (which will miss floating-point-non-associative-rounding-error-based nondeterminism). To make a test that is likely to exercise nondeterminism in a forward-path, it needs to use the op in a natural way, which most unit tests don't, or don't do much of. Most test cases are too small to exercise the kind of nondeterminism we want to protect against, which shows up with test cases that span asynchronous compute engines.

For the backwards paths (which is where nondeterminism usually shows up), the way that unit tests are written, comparing analytical and numerical Jacobian matrices, will not catch nondeterminism. This is because the analytical Jacobian matrix does not capture the backprop effect on real upstream gradients. The existing tests literally cannot see the nondeterminism. This is one of the factors that motivated me to develop the gradient injection approach to testing backprop determinism. Apart from that, even if the analytical Jacobians could capture the nondeterminism, the test cases would have to be made prohibitively large. The traditional method of gradient function testing, using the Jacobians, requires that the test cases be necessarily small (because the size of the Jacobians explodes in O(Nin*Nout), where N is the number of elements).

Automatic generation of effective determinism tests is a hard problem that I've been thinking about and discussing with others for some time. It will be an interesting problem to solve.

How about a test that automatically and randomly (based on a seed that makes its test case reproducible) creates a test case for one op path every time it runs. It would be a very complicated test because it would need to know, ultimately, how to use every op in TensorFlow. This test could be run at regular intervals, outside of the main CI process (I don't know what you call that; nightly?) and failures could be used to catch op-determinism regressions. It wouldn't automatically catch and block the introduction of the regression, but it would, over time, randomly re-audit every op. That test could be put in place initially for a small number of ops, to test on all devices, data types, etc and more ops could be added to it over time. It would also need to know details of which op paths/configs would throw d9m-unimplemented exceptions. It might read that from a YAML file. It could keep a record of its progress in a single file somewhere: seed, op, params, result.

I've been discussing this with @nluehr. I completely misunderstood the proposal. You want to add this feature under op calls. Then the test suite becomes a ton of pre-generated op calls. This makes sense. Sorry I misunderstood and went off on a tangent.

Okay, so then here are the things I'm worried about:

Floats that that have no fractional component. Solution: add/sub a fractional component

Calls from the gradient checker where the upstream gradient input is just 1.0. Solution: replace the input with a fractional random number between 0.0 and 1.0.

Tensors are too small to cause compute to span asynchronous processors. There doesn't seem to be a simple/automated solution for this problem, and all backward op runs (based on gradient_checker.compute_gradient) are going to, necessarily, have tiny tensors.

Another class of important stimulus that would likely be missing from existing op tests is where the relationship between input elements specifically triggers the most-common cause of nondeterminism (atomic reduction between asynchronous processes). For example, in segment reduction it's necessary for indices to be repeated more than once (with a large input tensor).

If it was possible to solve the problem of making tensors larger (while not breaking the rules of the op API), then the other half of the problem could be solved by randomizing control inputs (such as segment IDs), but then that would also need to be done in such as way as to not break the rules of the op API.

I'm not sure what you originally thought I was proposing, but your issues in the first post make sense (although I didn't understand quite the Jacobian complaint) and your suggestions in the fourth post would fix most of them.

I wonder if we can somehow enlarge the tensors by tiling them on an arbitrary dimension (or trying every dimension in sequence, to make each dimension 10x larger). This is a lot trickier when there are multiple inputs, like Add and MatMul, but we can always catch errors and simply not test that op for determinism with this strategy.

I think it's probably infeasible to for this technique to catch cases where the relationship between input elements cause probable nondeterminism. I doubt we would have caught segment reductions with this technique, but we can try. We can rely on other methods if necessary, like looking for uses of atomics and testing real-world models.

I think your idea in the third post of having a test suite that reads from a YAML file is reasonable as well, although it requires a small amount of per-op work. Conceptually this is very similar to the approach I suggested, except the configs are hand-written instead of automatically generated based on input shapes and attributes in unit tests. Also in my approach, the determinism tests are run during the unit tests, but perhaps instead the input shapes/attributes/dtypes should be written to a file and afterwards the determinism tests can be run. Then we could augment the file with hand-written tests as well, so that we can give larger inputs that are more likely to demonstrate nondeterminism.

duncanriach · 2021-03-12T00:14:15Z

rfcs/20210119-determinism.md

+
+2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism.
+
+3. When adding determinism to an op which previously was nondeterministic, an explicit unit test will be added that checks for determinism. This is slightly redundant with the special mode described above, but the explicit unit test can be part of the presubmit tests instead of the nightly tests, and can test on inputs that are very likely to demonstrate nondeterminism if it exists.


I don't think directed testing is redundant at all. The automated testing approach, described in the previous point, might (maybe) capture some small subset of issues, at potentially great additional compute cost, but it would miss most of the nondeterminism that we're trying to defend against.

A potentially more efficient and cost-effective way of defending against regressions might be to automatically flag PRs that contain suspect code, such as the use of CUDA atomic operations or sharding processes across CPU threads; then review those and/or require directed determinism tests (or tf.errors.UnimplementedError tests) for them.

It's true, though, that once an op is proven to be thoroughly deterministic using a human-developed directed test, it's very unlikely that the op will somehow start functioning nondeterministically again. So the non-automated, directed tests do primarily provide evidence that a previously nondeterministic op has been made to operate deterministically.

As you mentioned, it's unlikely the special mode would catch errors for an op which has an explicit unit tests, so there is slight redundancy. But I rephrased the sentence to avoid the word "redundancy".

I don't think it is worth it to flag PRs which have potentially unsafe constructs like atomics, since I think the effort to set that up will be high and there will be false positives. Others may disagree. /CC @sanjoy @pkanwar23

mgouicem · 2021-03-12T00:27:33Z

Hi @reedwm , in the updated RFC, there is no mention about what should happen with the support in CPU implementations, and about the performance implications of this feature. Could you please clarify the direction for both these points.
As mentioned above, most compute ops (e.g. matmul, convolution) are non-deterministic currently when run multi-threaded.

My impression was that Eigen was deterministic and that matmul/convolutions use Eigen, and so that CPU work would be fairly simple. But I haven't actually verified this through testing yet, as I've only tested GPU models. Do you know the source of nondeterminism of matmuls/convolutions? /CC @ezhulenev as well.

I will let @ezhulenev give a more complete answer. But I believe the Eigen threading layer can parallelize along the accumulation dimension (e.g. see here). And in that case, the number of threads in the threadpool could affect the result.

I am not sure about the order of accumulation across blocks though (from the code it looks like the accumulation of the blocks is sequential).

duncanriach · 2021-03-12T01:24:22Z

Hi @reedwm , in the updated RFC, there is no mention about what should happen with the support in CPU implementations, and about the performance implications of this feature. Could you please clarify the direction for both these points.
As mentioned above, most compute ops (e.g. matmul, convolution) are non-deterministic currently when run multi-threaded.

My impression was that Eigen was deterministic and that matmul/convolutions use Eigen, and so that CPU work would be fairly simple. But I haven't actually verified this through testing yet, as I've only tested GPU models. Do you know the source of nondeterminism of matmuls/convolutions? /CC @ezhulenev as well.

I will let @ezhulenev give a more complete answer. But I believe the Eigen threading layer can parallelize along the accumulation dimension (e.g. see here). And in that case, the number of threads in the threadpool could affect the result.

I am not sure about the order of accumulation across blocks though (from the code it looks like the accumulation of the blocks is sequential).

There are some cases of ops being nondeterministic on CPU, but they're much less common than on GPU (partly because of the much smaller amount of multi-threading). A work-around has been to set the number of inter-op and intra-op threads to one. I know it's not Eigen, but here is a datapoint: I am working on adding d9m-unimplemented exceptions to tf.image.crop_and_resize and looking at the (relatively rare) nondeterminism introduced on the CPU in the backprop to image. The work sharder is used in both the forward path and backward path to image, but due to the nature of the problem, nondeterminism is only introduced in the backprop path (with two or more gradients being reduced). Any time there is multi-threading and reduction of more than two elements, there is likely to be nondeterminism. I'm planning to force this particular op-path single threaded on CPU, for now, when determinism is required.

Regarding performance cost of determinism: on the GPU, I've worked on deterministic ops that run the same speed (for some configurations) down to 2x or even 10x slower, but still an order of magnitude or more faster than an optimized CPU implementation. (Note that deterministic algorithms that run faster than nondeterministic algorithms always end up supplanting the nondeterministic algorithms, so what we think of a deterministic algorithms is actually the slower sub-set of deterministic algorithms. Most GPU algorithms in TensorFlow are deterministic). However, since only some of the ops in any model run slower for determinism, the overall slowdown seems to be less than 2x and even less 10% in most cases. But then, even though one run may be slower, the whole training process is greatly accelerated because of massively reduced debug and experimentation time.

Co-authored-by: Duncan Riach <duncan@nvidia.com>

ezhulenev · 2021-03-12T19:56:23Z

I will let @ezhulenev give a more complete answer. But I believe the Eigen threading layer can parallelize along the accumulation dimension (e.g. see here). And in that case, the number of threads in the threadpool could affect the result.

I am not sure about the order of accumulation across blocks though (from the code it looks like the accumulation of the blocks is sequential).

AFAIK accumulation order is deterministic everywhere in Eigen, the number of threads is a good point, never thought about that. Threads can change the block size for matmuls, and might trigger different code path inside mkldnn (can it?), and different number of accumulations along the k dim.

reedwm · 2021-03-12T20:12:39Z

What determines the number of threads in the threadpool? Is this fixed or can it decrease per op if multiple ops run in parallel?

ezhulenev · 2021-03-12T20:21:11Z

It is the intra_op thread pool size in TF, and it stays constant.

reedwm · 2021-03-12T20:30:26Z

Thanks for the info!. Given the thread size is fixed, it will still be deterministic, unless there's some other source of nondeterminism. Users must not change the thread pool size between runs if they want deterministic behavior.

mgouicem · 2021-03-12T22:48:12Z

@ezhulenev , yes the size of the block might trigger different paths in oneDNN (new name for mkldnn :-)), each with different accumulation order. But if the number of threads is constant between runs, the chunk size should be constant and the path we dispatch should be constant as well.
Thanks for the clarification!

sanjoy

LGTM

pkanwar23 added 2 commits January 19, 2021 20:53

Create 20210119-determinism.md

5961366

Creating RFC for determinism in TensorFlow

Update 20210119-determinism.md

2e72dd2

pkanwar23 requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners January 20, 2021 05:04

google-cla bot added the cla: yes label Jan 20, 2021

ematejska changed the title ~~RFC for enhancing determinism in TF.~~ RFC: Enhancing determinism in TF. Jan 20, 2021

ematejska requested a review from a team January 20, 2021 17:29

ematejska changed the title ~~RFC: Enhancing determinism in TF.~~ RFC: Enhancing determinism in TF Jan 20, 2021

duncanriach suggested changes Jan 21, 2021

View reviewed changes

duncanriach reviewed Jan 22, 2021

View reviewed changes

rfcs/20210119-determinism.md Outdated Show resolved Hide resolved

duncanriach reviewed Jan 22, 2021

View reviewed changes

rfcs/20210119-determinism.md Outdated Show resolved Hide resolved

duncanriach reviewed Jan 22, 2021

View reviewed changes

rfcs/20210119-determinism.md Outdated Show resolved Hide resolved

ematejska added the RFC: Proposed RFC Design Document label Jan 25, 2021

ematejska requested a review from rohan100jain January 26, 2021 17:43

WindQAQ mentioned this pull request Jan 30, 2021

Inconsistency of computations with different batch size tensorflow/tensorflow#46770

Closed

pkanwar23 added 2 commits January 31, 2021 11:35

Update 20210119-determinism.md

8743797

Updating the RFC based on comments from NVIDIA.

Update 20210119-determinism.md

62e3a17

rohan100jain reviewed Feb 3, 2021

View reviewed changes

Address comments

3979103

duncanriach reviewed Feb 4, 2021

View reviewed changes

Fix typo.

7cfaa66

duncanriach reviewed Feb 4, 2021

View reviewed changes

duncanriach reviewed Feb 10, 2021

View reviewed changes

duncanriach reviewed Feb 27, 2021

View reviewed changes

Minor fixes

a315f84

duncanriach reviewed Mar 11, 2021

View reviewed changes

duncanriach reviewed Mar 12, 2021

View reviewed changes

Make changes suggested by @duncanriach

2406126

Co-authored-by: Duncan Riach <duncan@nvidia.com>

duncanriach mentioned this pull request Mar 12, 2021

[determinism] Add segment reduction op exceptions for GPU determinism tensorflow/tensorflow#47772

Merged

sanjoy approved these changes Mar 19, 2021

View reviewed changes

duncanriach mentioned this pull request Mar 19, 2021

[determinism] Add softmax/cross-entropy op exceptions for GPU determinism tensorflow/tensorflow#47925

Merged

ematejska approved these changes Mar 19, 2021

View reviewed changes

Update 20210119-determinism.md

c739600

ematejska merged commit 41a8fbf into tensorflow:master Mar 19, 2021

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Mar 19, 2021

njzjz mentioned this pull request Mar 15, 2024

Reproducibility of LAMMPS run with DP potential deepmodeling/deepmd-kit#3270

Open



		## Objective
		Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.


		However, supporting a subset of CPU ops is problematic: what if a rewrite pass converts Op A (which supports determinism) to Op B (which does not). Similarly, Placer may be modified in the future to place some small ops on the CPU by default instead of the GPU, which can break determinism. In general, modifications to TensorFlow that affect ops can potentially break determinism, and therefore break backwards compatibility.

		We don’t want TensorFlow developers to have to worry about breaking determinism when modifying TensorFlow. We could potentially allow a model to start raising a determinism error in minor releases of TF, but this is a bad user experience. Alternatively, we could rely on unit tests to catch cases where developers break determinism. Another alternative is to fully support determinism on the CPU. I and others will try to think of other ways to avoid developers inadvertently breaking determinism when modifying TensorFlow.


		The lack of determinism in certain ops prevents companies from launching products using models developed in TF. For a subset of these industries having deterministic behavior is a regulatory requirement.

		In addition, deterministic ops increases model velocity development by reducing noise, while also simplifying the debugging workflow.


		The first function takes in a boolean value, and allows the model developer to enable/disable deterministic ops. The second function returns a bool indicating whether deterministic ops is enabled.

		Once enabled, every built-in op will either be made deterministic or raise an error if determinism is not supported. For ops which we have not yet implemented a deterministic version, a `NotImplementedError` will be raised. In the long term, we plan on adding a deterministic version to all such ops. For ops which are inherently nondeterministic such as `tf.random.normal` without a seed, a `FailedPreconditionError` will be raised (the precondition being that determinism must be disabled). Certain ops will only raise an error for certain input shapes or attributes. Depending on the op, in graph mode, the error will either be raised when the op is constructed or when the op is run.


		1. We will add tests to several of the [official models](https://github.com/tensorflow/models/tree/master/official) to ensure they run deterministically. In particular, each test will train a model for several steps, then retrain it from scratch several times. The final weights after training will be asserted to be the same each time. This tests not only the `enable_deterministic_ops` API but that the entire model is deterministic. This only tests ops that the official models use.

		2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism.


		2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism.

		3. When adding determinism to an op which previously was nondeterministic, an explicit unit test will be added that checks for determinism. This is slightly redundant with the special mode described above, but the explicit unit test can be part of the presubmit tests instead of the nightly tests, and can test on inputs that are very likely to demonstrate nondeterminism if it exists.

RFC: Enhancing determinism in TF #346

RFC: Enhancing determinism in TF #346

Uh oh!

Conversation

pkanwar23 commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Uh oh!

duncanriach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ematejska commented Jan 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reedwm commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgouicem commented Mar 11, 2021

Uh oh!

pkanwar23 commented Jan 20, 2021 •

edited

Loading

duncanriach Feb 4, 2021 •

edited

Loading

duncanriach Feb 4, 2021 •

edited

Loading

duncanriach Feb 10, 2021 •

edited

Loading

reedwm commented Mar 11, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading

duncanriach Mar 12, 2021 •

edited

Loading