-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enable TF32 support for cuDNN #40737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 2d85abe (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
I think we need something like this } else if (dataType == CUDNN_DATA_FLOAT && !allow_tf32) {
#if defined(CUDNN_VERSION) && CUDNN_VERSION >= 8000
cudnnSetRNNMatrixMathType(mut_desc(), CUDNN_FMA_MATH);
#endif
}where the math mode is set for RNNs as well. (github wont let me comment on the line directly) |
Summary: For mobile custom build, we only generate code for ops that are used by specific models to reduce binary size. There multiple places where we apply the op filtering: - generated_unboxing_wrappers_*.cpp - autograd/VariableType*.cpp - c10 op registration (in aten/gen.py) For c10 op registration, we filter by the main op name - all overloads that match the main op name part will be kept. For generated_unboxing_wrappers_*, we filter by the full op name - only those having exactly the same overload name will be kept. This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp codegen to also filter by the main op name. The reasons are: - keeping all overloads can have better backward compatibility; - generated_unboxing_wrappers_* are relatively small as it only contains thin wrappers for root ops. - generated_unboxing_wrappers_* will be replaced by c10 op registration soon anyway. - autograd/VariableType*.cpp are not included in OSS build. Why it offers better backward compatibility? #40737 is an example: It introduced a new `_convolution` overload and renamed the original one to `_convolution.deprecated`. Before this PR, the model prepared by the old version PyTorch won't be able to run on the custom mobile build generated on the PR because `_convolution.deprecated` won't be kept in the custom build due to full op name matching policy. By relaxing it to partial matching policy, the mobile custom build CI on the PR can pass. Will test the size impact for FB production build before landing. Differential Revision: [D22809564](https://our.internmc.facebook.com/intern/diff/D22809564) [ghstack-poisoned]
Summary: For mobile custom build, we only generate code for ops that are used by specific models to reduce binary size. There multiple places where we apply the op filtering: - generated_unboxing_wrappers_*.cpp - autograd/VariableType*.cpp - c10 op registration (in aten/gen.py) For c10 op registration, we filter by the main op name - all overloads that match the main op name part will be kept. For generated_unboxing_wrappers_*, we filter by the full op name - only those having exactly the same overload name will be kept. This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp codegen to also filter by the main op name. The reasons are: - keeping all overloads can have better backward compatibility; - generated_unboxing_wrappers_* are relatively small as it only contains thin wrappers for root ops. - generated_unboxing_wrappers_* will be replaced by c10 op registration soon anyway. - autograd/VariableType*.cpp are not included in OSS build. Why it offers better backward compatibility? #40737 is an example: It introduced a new `_convolution` overload and renamed the original one to `_convolution.deprecated`. Before this PR, the model prepared by the old version PyTorch won't be able to run on the custom mobile build generated on the PR because `_convolution.deprecated` won't be kept in the custom build due to full op name matching policy. By relaxing it to partial matching policy, the mobile custom build CI on the PR can pass. Will test the size impact for FB production build before landing. Differential Revision: [D22809564](https://our.internmc.facebook.com/intern/diff/D22809564) [ghstack-poisoned]
Summary: For mobile custom build, we only generate code for ops that are used by specific models to reduce binary size. There multiple places where we apply the op filtering: - generated_unboxing_wrappers_*.cpp - autograd/VariableType*.cpp - c10 op registration (in aten/gen.py) For c10 op registration, we filter by the main op name - all overloads that match the main op name part will be kept. For generated_unboxing_wrappers_*, we filter by the full op name - only those having exactly the same overload name will be kept. This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp codegen to also filter by the main op name. The reasons are: - keeping all overloads can have better backward compatibility; - generated_unboxing_wrappers_* are relatively small as it only contains thin wrappers for root ops. - generated_unboxing_wrappers_* will be replaced by c10 op registration soon anyway. - autograd/VariableType*.cpp are not included in OSS build. Why it offers better backward compatibility? #40737 is an example: It introduced a new `_convolution` overload and renamed the original one to `_convolution.deprecated`. Before this PR, the model prepared by the old version PyTorch won't be able to run on the custom mobile build generated on the PR because `_convolution.deprecated` won't be kept in the custom build due to full op name matching policy. By relaxing it to partial matching policy, the mobile custom build CI on the PR can pass. Will test the size impact for FB production build before landing. ghstack-source-id: 7f87d57 Pull Request resolved: #42214
Summary: For mobile custom build, we only generate code for ops that are used by specific models to reduce binary size. There multiple places where we apply the op filtering: - generated_unboxing_wrappers_*.cpp - autograd/VariableType*.cpp - c10 op registration (in aten/gen.py) For c10 op registration, we filter by the main op name - all overloads that match the main op name part will be kept. For generated_unboxing_wrappers_*, we filter by the full op name - only those having exactly the same overload name will be kept. This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp codegen to also filter by the main op name. The reasons are: - keeping all overloads can have better backward compatibility; - generated_unboxing_wrappers_* are relatively small as it only contains thin wrappers for root ops. - generated_unboxing_wrappers_* will be replaced by c10 op registration soon anyway. - autograd/VariableType*.cpp are not included in OSS build. Why it offers better backward compatibility? #40737 is an example: It introduced a new `_convolution` overload and renamed the original one to `_convolution.deprecated`. Before this PR, the model prepared by the old version PyTorch won't be able to run on the custom mobile build generated on the PR because `_convolution.deprecated` won't be kept in the custom build due to full op name matching policy. By relaxing it to partial matching policy, the mobile custom build CI on the PR can pass. Will test the size impact for FB production build before landing. Differential Revision: D22809564 Test Plan: Imported from OSS Reviewed By: iseeyuan Pulled By: ljk53 fbshipit-source-id: e2fc017da31f38b9430cc2113f33e6d21a0eaf0b
|
FYI, if you rebase on #42214 the OSS mobile CIs should pass. @ngimel could you please take a look at my question on #42214 (comment) - 1) what's the reason we want to keep the old schema as "_convolution.deprecated" (cannot non-mobile case handle BC with the default value?) 2) if we do want to keep the old schema, can we give the new schema a different overload instead? This way we can workaround internal lite-interpreter BC problems easily. |
|
@zasdfgbnm per @ljk53's point, it makes sense instead of creating new overload for _convolution just add an argument with default value - server bc should be able to handle it, and that should make mobile fixes easier? |
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we 1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion. 2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4). 3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. [ghstack-poisoned]
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we 1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion. 2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4). 3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. [ghstack-poisoned]
No - it's the same for the purpose of fixing lite-interp. I asked just to check my understanding. Sounds there was legit reason not to pick a true/false default value for the new bool param. We have split conversions at different PRs, so here is the summary of status:
I mentioned 2c) just to show a possible way to workaround lite-interp BC problem in general but it might not the most reasonable fix for this specific case. 2b) might be a reasonable thing to do but I'll let you guys decide. Since Martin has prepared 2a) we can go with it if we decide to add the new param. |
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we 1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion. 2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4). 3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. Differential Revision: [D22898314](https://our.internmc.facebook.com/intern/diff/D22898314) [ghstack-poisoned]
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we 1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion. 2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4). 3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. Differential Revision: [D22898314](https://our.internmc.facebook.com/intern/diff/D22898314) [ghstack-poisoned]
Summary: Pull Request resolved: #42413 When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we 1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion. 2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4). 3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. Test Plan: Imported from OSS Reviewed By: xcheng16 Differential Revision: D22898314 Pulled By: iseeyuan fbshipit-source-id: 90d339f8e1365f4bb178db8db7c147390173372b
|
#42413 has been landed. Please bump kProducedBytecodeVersion (in caffe2/serialize/inline_container.h) to 0x4L in this PR so that it will not break backward compatibility for mobile. |
|
@iseeyuan Great! Thanks for the reminder! |
|
This should be ready? |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Pull Request resolved: #40737 Reviewed By: mruberry Differential Revision: D22801525 Pulled By: ngimel fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
Summary: Pull Request resolved: pytorch/pytorch#40737 Reviewed By: mruberry Differential Revision: D22801525 Pulled By: ngimel fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
No description provided.