-
Notifications
You must be signed in to change notification settings - Fork 582
RFC: Enhancing determinism in TF #346
Conversation
Creating RFC for determinism in TensorFlow
duncanriach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small enhancements and/or clarifications.
rfcs/20210119-determinism.md
Outdated
|
|
||
| * `tf.nn.softmax_cross_entropy_with_logits` | ||
| * `tf.nn.sparse_softmax_cross_entropy_with_logits` | ||
| * `tf.image.resize` with method=ResizeMethod.NEAREST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gradient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for which one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is on line 69, and referring to that op: tf.image.resize with method=ResizeMethod.NEAREST.
Note that the other two ops are loss functions, so (in training) their back-prop paths aways end up getting used. I believe, though I'm not totally certain, that the nondeterminism is introduced in the gradient of those functions too, but I didn't mention it because of this fact that they'll always be used in real models with the gradient (at least in training, even if the model is later used only in inference with no gradient).
When the exception-throwing code is added, it may end up being only added for the gradient kernel because that's the only path that is actually injecting nondeterminism. Whatever the functionality it is, the documentation will be updated to match.
|
Adding @rohan100jain to review from the api owners side. |
Updating the RFC based on comments from NVIDIA.
rfcs/20210119-determinism.md
Outdated
| ## Objective | ||
| Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs. | ||
|
|
||
| To get deterministic behavior, users must do the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here is whether Keras is in scope here as well? I believe in some conversations with @fchollet that there might be some non-determinism introduced in the Keras framework as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the other source of randomness in Keras (besides variable initializers) is going to be layer calls (like dropout).
We intend these to rely on stateless ops, seeded by a seed argument in call. That seed would be either provided by a per-model stateful RNG when calling the model, or if left unspecified would be autogenerated by a global RNG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both variable initializers and layers currently use legacy random ops, which are made deterministic when enabling determinism (see section "Random ops"). Stateless ops are always deterministic. So Keras will be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In support of what Reed wrote above, in my experience, all nondeterminism seen (and resolved) when using Keras have been due to underlying issues with non-keras TensorFlow.
| * TF_DETERMINISTIC_OPS | ||
| * TF_CUDNN_DETERMINISTIC | ||
|
|
||
| tf.data also has flags for determinism. The system will throw an error message if flags are out of sync i.e. if deterministic_execution_enabled is enabled but if the tf.data option is set to ‘false’, we will throw an error. (`tf.data.Options.experimental_deterministic`). We’ll also add the necessary checks for Dataset.map and Dataset.interleave. See the [Random ops](#random-ops) section for how random Datasets, such as `tf.data.experimental.RandomDataset`, are handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we also do these inconsistency checks for the 2 environ variables mentioned above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because unlike the tf.data flag, these environmental variables are treated as False by default. If users enable deterministic using this API, we don't want to require users to also set environmental variables to avoid an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the environment variables will be deprecated, and their functionality will not be further enhanced.
rfcs/20210119-determinism.md
Outdated
| * Enable determinism using the API proposed in this doc. | ||
| * Use same hardware in every run. | ||
| * Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc). | ||
| * Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom ops? What op set is the determinism guarantee for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Since we do not control custom ops, users must ensure custom ops (if any) are deterministic.
rfcs/20210119-determinism.md
Outdated
|
|
||
|
|
||
| ## Objective | ||
| Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "Allow users to enable deterministic behavior in TensorFlow."
rfcs/20210119-determinism.md
Outdated
| To get deterministic behavior, users must do the following: | ||
|
|
||
| * Enable determinism using the API proposed in this doc. | ||
| * Use same hardware in every run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be misleading. You don't have to use exactly the same hardware instance, just the same hardware configuration. So, in the single accelerator case, that might be the same accelerator silicon architecture. I assume that we're not addressing multi-accelerator determinism here. That's another topic.
I suggest: Use the same hardware configuration in every run.
rfcs/20210119-determinism.md
Outdated
|
|
||
| * Enable determinism using the API proposed in this doc. | ||
| * Use same hardware in every run. | ||
| * Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of GPUs, this includes CUDA version, cuDNN version, and also (even, possibly) driver version. A change in any of these parameters between runs could lead to non-bit-exactness between runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added CUDA version. Didn't add others for brevity (they are covered by the "etc")
rfcs/20210119-determinism.md
Outdated
| * Enable determinism using the API proposed in this doc. | ||
| * Use same hardware in every run. | ||
| * Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc). | ||
| * Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python's random module is not nondeterministic. As far as I know, it supports full deterministic functionality, but to enable determinism the user just has to initialize it (as she would for any other compute process from which she desired deterministic functionality).
Suggested change: Not use constructs outside TensorFlow that are nondeterministic, such as Python’s random module (without appropriate use of its ability to be initialized/seeded) or using multiple threads/processes in ways that influence TensorFlow’s behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used a slightly shorter version.
rfcs/20210119-determinism.md
Outdated
| * Enable determinism using the API proposed in this doc. | ||
| * Use same hardware in every run. | ||
| * Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc). | ||
| * Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is at least one case (that I'm aware of) where third-party multi-GPU training systems (in this case Horovod) can introduce nondeterminism unless they are configured specifically to operate deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming such systems have a dependency on TensorFlow, they should query deterministic_execution_enabled and if it's true, either enable determinism or raise a warning/error. I don't think it's worth mentioning here.
rfcs/20210119-determinism.md
Outdated
| * Do not use nondeterministic custom ops. | ||
|
|
||
| ## Motivation | ||
| There are several mission critical applications in life sciences, finance and automation that require deterministic behavior. Determinism is required so that the behavior of these applications can be accurately predicted & demonstrated in a variety of scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change life sciences to medicine.
rfcs/20210119-determinism.md
Outdated
|
|
||
| However, supporting a subset of CPU ops is problematic: what if a rewrite pass converts Op A (which supports determinism) to Op B (which does not). Similarly, Placer may be modified in the future to place some small ops on the CPU by default instead of the GPU, which can break determinism. In general, modifications to TensorFlow that affect ops can potentially break determinism, and therefore break backwards compatibility. | ||
|
|
||
| We don’t want TensorFlow developers to have to worry about breaking determinism when modifying TensorFlow. We could potentially allow a model to start raising a determinism error in minor releases of TF, but this is a bad user experience. Alternatively, we could rely on unit tests to catch cases where developers break determinism. Another alternative is to fully support determinism on the CPU. I and others will try to think of other ways to avoid developers inadvertently breaking determinism when modifying TensorFlow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An op either operates nondeterministically, or throws and exception. If we miss an exception, then the nondeterminism will prevent a user from running their model deterministically with determinism enabled. So adding a valid exception throw will not realistically break anything.
When deterministic functionality is added, it's usually added with a test to confirm deterministic functionality of the op (at the op level). Testing of deterministic gradients is done using an approach that I call gradient injection.
rfcs/20210119-determinism.md
Outdated
|
|
||
| The first function takes in a boolean value, and allows the model developer to enable/disable determinism. The second function returns a bool indicating whether determinism is enabled. | ||
| In some cases, we have deterministic and nondeterministic versions of the kernel. In such cases, we will use this flag to run the appropriate kernels. | ||
| For ops which do not yet have a deterministic implementation, TensorFlow will raise a `tf.errors.UnimplementedError` if the flag is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth clarifying here:
When deterministic execution is enabled, tf.errors.UnimplementedError will be thrown if a nondeterministic code path through an op would otherwise be traversed (either in eager mode or during graph execution). Note that it may still be possible to simply construct a graph containing ops that have nondeterministic code paths through them without the error being thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the sentences
Certain ops will only raise an error for certain input shapes or attributes. Depending on the op, in graph mode, the error will either be raised when the op is constructed or when the op is run.
The first sentence addresses your first point, as it implies that only certain nondeterministic codepaths raise errors. For you second point: it depends on whether the op raises an error when it is constructed or run. I think both are acceptable, so in the RFC I left this up to each op's author.
The RFC was changed so that only individual ops are deterministic, not TensorFlow as a whole. This limits the scope of the API, while still allowing users to write deterministic models. Additional changes: * Add "Testing plan" section * Add original RFC's API to "Alternatives considered" * Add more ops to "Op Review and changes" section * Raise FailedPreconditionError when op is fundamentally not deterministic, not a NotImplementedError * Remove discussion section, since it doesn't apply to the new version of the RFC. There is a link to the old version of the RFC (with the discussion section) in "Alternatives considered" * Alphabetize author list * Move some info from "Objective" to "Design proposal" * Grammar fixes
|
I significantly changed the RFC so that enabling determinism only affects ops, not all of TensorFlow. I also made various other changes. No one else other than me has reviewed the updated RFC yet so @pkanwar23 @duncanriach @sanjoy, please take a look. |
|
Hi @reedwm , in the updated RFC, there is no mention about what should happen with the support in CPU implementations, and about the performance implications of this feature. Could you please clarify the direction for both these points. Tagging @penpornk @agramesh1. |
My impression was that Eigen was deterministic and that matmul/convolutions use Eigen, and so that CPU work would be fairly simple. But I haven't actually verified this through testing yet, as I've only tested GPU models. Do you know the source of nondeterminism of matmuls/convolutions? /CC @ezhulenev as well. |
rfcs/20210119-determinism.md
Outdated
|
|
||
| The lack of determinism in certain ops prevents companies from launching products using models developed in TF. For a subset of these industries having deterministic behavior is a regulatory requirement. | ||
|
|
||
| In addition, deterministic ops increases model velocity development by reducing noise, while also simplifying the debugging workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increases -> increase
Alternatively, suggest: "deterministic functionality, enabled by deterministic ops, increases"
rfcs/20210119-determinism.md
Outdated
|
|
||
| The first function takes in a boolean value, and allows the model developer to enable/disable deterministic ops. The second function returns a bool indicating whether deterministic ops is enabled. | ||
|
|
||
| Once enabled, every built-in op will either be made deterministic or raise an error if determinism is not supported. For ops which we have not yet implemented a deterministic version, a `NotImplementedError` will be raised. In the long term, we plan on adding a deterministic version to all such ops. For ops which are inherently nondeterministic such as `tf.random.normal` without a seed, a `FailedPreconditionError` will be raised (the precondition being that determinism must be disabled). Certain ops will only raise an error for certain input shapes or attributes. Depending on the op, in graph mode, the error will either be raised when the op is constructed or when the op is run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For ops which we have not yet implemented a deterministic version, a NotImplementedError will be raised" (grammar) -> "A tf.errors.UnimplementedError will be raised by ops for which we have not yet implemented a deterministic version."
Also, throughout, NotImplementedError -> tf.errors.UnimplementedError
"Certain ops will only raise an error for certain input shapes or attributes." -> "Some ops will only raise an error on a subset of input shapes, attributes, data types, or data-paths through the op."
rfcs/20210119-determinism.md
Outdated
| The API allows users to write deterministic models. To do so, users must: | ||
|
|
||
| * Enable deterministic ops with `tf.config.enable_deterministic_ops`. | ||
| * Use same hardware configuration in every run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Use same hardware configuration ..." -> "Use the same hardware configuration ..."
rfcs/20210119-determinism.md
Outdated
|
|
||
| * Enable deterministic ops with `tf.config.enable_deterministic_ops`. | ||
| * Use same hardware configuration in every run. | ||
| * Use the same software environment every run (OS, checkpoints, version of CUDA and TF, environmental variables, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Use the same software environment every run ..." -> "Use the same software environment in every run ..." in or on - consistent with last point.
|
|
||
| 1. We will add tests to several of the [official models](https://github.com/tensorflow/models/tree/master/official) to ensure they run deterministically. In particular, each test will train a model for several steps, then retrain it from scratch several times. The final weights after training will be asserted to be the same each time. This tests not only the `enable_deterministic_ops` API but that the entire model is deterministic. This only tests ops that the official models use. | ||
|
|
||
| 2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most forward tests are too small and/or not designed in some other way to test for determinism. For example, many tests use integers even when exercising floating-point data-paths (which will miss floating-point-non-associative-rounding-error-based nondeterminism). To make a test that is likely to exercise nondeterminism in a forward-path, it needs to use the op in a natural way, which most unit tests don't, or don't do much of. Most test cases are too small to exercise the kind of nondeterminism we want to protect against, which shows up with test cases that span asynchronous compute engines.
For the backwards paths (which is where nondeterminism usually shows up), the way that unit tests are written, comparing analytical and numerical Jacobian matrices, will not catch nondeterminism. This is because the analytical Jacobian matrix does not capture the backprop effect on real upstream gradients. The existing tests literally cannot see the nondeterminism. This is one of the factors that motivated me to develop the gradient injection approach to testing backprop determinism. Apart from that, even if the analytical Jacobians could capture the nondeterminism, the test cases would have to be made prohibitively large. The traditional method of gradient function testing, using the Jacobians, requires that the test cases be necessarily small (because the size of the Jacobians explodes in O(Nin*Nout), where N is the number of elements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatic generation of effective determinism tests is a hard problem that I've been thinking about and discussing with others for some time. It will be an interesting problem to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a test that automatically and randomly (based on a seed that makes its test case reproducible) creates a test case for one op path every time it runs. It would be a very complicated test because it would need to know, ultimately, how to use every op in TensorFlow. This test could be run at regular intervals, outside of the main CI process (I don't know what you call that; nightly?) and failures could be used to catch op-determinism regressions. It wouldn't automatically catch and block the introduction of the regression, but it would, over time, randomly re-audit every op. That test could be put in place initially for a small number of ops, to test on all devices, data types, etc and more ops could be added to it over time. It would also need to know details of which op paths/configs would throw d9m-unimplemented exceptions. It might read that from a YAML file. It could keep a record of its progress in a single file somewhere: seed, op, params, result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been discussing this with @nluehr. I completely misunderstood the proposal. You want to add this feature under op calls. Then the test suite becomes a ton of pre-generated op calls. This makes sense. Sorry I misunderstood and went off on a tangent.
Okay, so then here are the things I'm worried about:
- Floats that that have no fractional component. Solution: add/sub a fractional component
- Calls from the gradient checker where the upstream gradient input is just 1.0. Solution: replace the input with a fractional random number between 0.0 and 1.0.
- Tensors are too small to cause compute to span asynchronous processors. There doesn't seem to be a simple/automated solution for this problem, and all backward op runs (based on
gradient_checker.compute_gradient) are going to, necessarily, have tiny tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another class of important stimulus that would likely be missing from existing op tests is where the relationship between input elements specifically triggers the most-common cause of nondeterminism (atomic reduction between asynchronous processes). For example, in segment reduction it's necessary for indices to be repeated more than once (with a large input tensor).
If it was possible to solve the problem of making tensors larger (while not breaking the rules of the op API), then the other half of the problem could be solved by randomizing control inputs (such as segment IDs), but then that would also need to be done in such as way as to not break the rules of the op API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you originally thought I was proposing, but your issues in the first post make sense (although I didn't understand quite the Jacobian complaint) and your suggestions in the fourth post would fix most of them.
I wonder if we can somehow enlarge the tensors by tiling them on an arbitrary dimension (or trying every dimension in sequence, to make each dimension 10x larger). This is a lot trickier when there are multiple inputs, like Add and MatMul, but we can always catch errors and simply not test that op for determinism with this strategy.
I think it's probably infeasible to for this technique to catch cases where the relationship between input elements cause probable nondeterminism. I doubt we would have caught segment reductions with this technique, but we can try. We can rely on other methods if necessary, like looking for uses of atomics and testing real-world models.
I think your idea in the third post of having a test suite that reads from a YAML file is reasonable as well, although it requires a small amount of per-op work. Conceptually this is very similar to the approach I suggested, except the configs are hand-written instead of automatically generated based on input shapes and attributes in unit tests. Also in my approach, the determinism tests are run during the unit tests, but perhaps instead the input shapes/attributes/dtypes should be written to a file and afterwards the determinism tests can be run. Then we could augment the file with hand-written tests as well, so that we can give larger inputs that are more likely to demonstrate nondeterminism.
rfcs/20210119-determinism.md
Outdated
|
|
||
| 2. We will add a special mode to TensorFlow where every time a non-stateful op is run, TensorFlow will rerun the op several times and assert the outputs are the same each time. We will then run the TensorFlow unit tests with this mode as part of the nightly tests. Doing so ensures that for each op that is run as part of a unit test, it will be tested for determinism. | ||
|
|
||
| 3. When adding determinism to an op which previously was nondeterministic, an explicit unit test will be added that checks for determinism. This is slightly redundant with the special mode described above, but the explicit unit test can be part of the presubmit tests instead of the nightly tests, and can test on inputs that are very likely to demonstrate nondeterminism if it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think directed testing is redundant at all. The automated testing approach, described in the previous point, might (maybe) capture some small subset of issues, at potentially great additional compute cost, but it would miss most of the nondeterminism that we're trying to defend against.
A potentially more efficient and cost-effective way of defending against regressions might be to automatically flag PRs that contain suspect code, such as the use of CUDA atomic operations or sharding processes across CPU threads; then review those and/or require directed determinism tests (or tf.errors.UnimplementedError tests) for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true, though, that once an op is proven to be thoroughly deterministic using a human-developed directed test, it's very unlikely that the op will somehow start functioning nondeterministically again. So the non-automated, directed tests do primarily provide evidence that a previously nondeterministic op has been made to operate deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, it's unlikely the special mode would catch errors for an op which has an explicit unit tests, so there is slight redundancy. But I rephrased the sentence to avoid the word "redundancy".
I don't think it is worth it to flag PRs which have potentially unsafe constructs like atomics, since I think the effort to set that up will be high and there will be false positives. Others may disagree. /CC @sanjoy @pkanwar23
I will let @ezhulenev give a more complete answer. But I believe the Eigen threading layer can parallelize along the accumulation dimension (e.g. see here). And in that case, the number of threads in the threadpool could affect the result. I am not sure about the order of accumulation across blocks though (from the code it looks like the accumulation of the blocks is sequential). |
There are some cases of ops being nondeterministic on CPU, but they're much less common than on GPU (partly because of the much smaller amount of multi-threading). A work-around has been to set the number of inter-op and intra-op threads to one. I know it's not Eigen, but here is a datapoint: I am working on adding d9m-unimplemented exceptions to Regarding performance cost of determinism: on the GPU, I've worked on deterministic ops that run the same speed (for some configurations) down to 2x or even 10x slower, but still an order of magnitude or more faster than an optimized CPU implementation. (Note that deterministic algorithms that run faster than nondeterministic algorithms always end up supplanting the nondeterministic algorithms, so what we think of a deterministic algorithms is actually the slower sub-set of deterministic algorithms. Most GPU algorithms in TensorFlow are deterministic). However, since only some of the ops in any model run slower for determinism, the overall slowdown seems to be less than 2x and even less 10% in most cases. But then, even though one run may be slower, the whole training process is greatly accelerated because of massively reduced debug and experimentation time. |
Co-authored-by: Duncan Riach <duncan@nvidia.com>
AFAIK accumulation order is deterministic everywhere in Eigen, the number of threads is a good point, never thought about that. Threads can change the block size for matmuls, and might trigger different code path inside mkldnn (can it?), and different number of accumulations along the |
|
What determines the number of threads in the threadpool? Is this fixed or can it decrease per op if multiple ops run in parallel? |
|
It is the |
|
Thanks for the info!. Given the thread size is fixed, it will still be deterministic, unless there's some other source of nondeterminism. Users must not change the thread pool size between runs if they want deterministic behavior. |
|
@ezhulenev , yes the size of the block might trigger different paths in oneDNN (new name for mkldnn :-)), each with different accumulation order. But if the number of threads is constant between runs, the chunk size should be constant and the path we dispatch should be constant as well. |
sanjoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Open for comment until 2/4/2021
Objective
Allow users to enable determinism behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.
To get deterministic behavior, users must do the following:
randommodule or using multiple threads/processes in ways that influence TensorFlow’s behavior