Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Jun 29, 2020

No description provided.

ghstack-source-id: 9504cd9
Pull Request resolved: #39361
@dr-ci
Copy link

dr-ci bot commented Jun 29, 2020

💊 CI failures summary and remediations

As of commit 2d85abe (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_rocm3_5_1_py3_6_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 21 22:12:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected ‘;’ before ‘}’ token\n int main() { return 0 }\n ^\n" }
Aug 21 22:12:24 Traceback (most recent call last): 
Aug 21 22:12:24   File "test/run_test.py", line 719, in <module> 
Aug 21 22:12:24     main() 
Aug 21 22:12:24   File "test/run_test.py", line 708, in main 
Aug 21 22:12:24     raise RuntimeError(err) 
Aug 21 22:12:24 RuntimeError: distributed/test_c10d failed! Received signal: SIGIOT 
Aug 21 22:12:24 + cleanup 
Aug 21 22:12:24 + retcode=1 
Aug 21 22:12:24 + set +x 
Aug 21 22:12:24 =================== sccache compilation log =================== 
Aug 21 22:12:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected ‘;’ before ‘}’ token\n int main() { return 0 }\n                       ^\n" } 
Aug 21 22:12:24  
Aug 21 22:12:24 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 21 22:12:24 Compile requests                155 
Aug 21 22:12:24 Compile requests executed        53 
Aug 21 22:12:24 Cache hits                        3 
Aug 21 22:12:24 Cache misses                     49 
Aug 21 22:12:24 Cache timeouts                    0 
Aug 21 22:12:24 Cache read errors                 0 
Aug 21 22:12:24 Forced recaches                   0 
Aug 21 22:12:24 Cache write errors                0 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 252 times.

@mcarilli
Copy link
Collaborator

mcarilli commented Jul 15, 2020

I think we need something like this

} else if (dataType == CUDNN_DATA_FLOAT && !allow_tf32) {
#if defined(CUDNN_VERSION) && CUDNN_VERSION >= 8000
      cudnnSetRNNMatrixMathType(mut_desc(), CUDNN_FMA_MATH);
#endif
}

where the math mode is set for RNNs as well. (github wont let me comment on the line directly)

ljk53 added a commit that referenced this pull request Jul 30, 2020
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

Differential Revision: [D22809564](https://our.internmc.facebook.com/intern/diff/D22809564)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Jul 30, 2020
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

Differential Revision: [D22809564](https://our.internmc.facebook.com/intern/diff/D22809564)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Jul 30, 2020
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

ghstack-source-id: 7f87d57
Pull Request resolved: #42214
facebook-github-bot pushed a commit that referenced this pull request Jul 31, 2020
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

Differential Revision: D22809564

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Pulled By: ljk53

fbshipit-source-id: e2fc017da31f38b9430cc2113f33e6d21a0eaf0b
@ljk53
Copy link
Contributor

ljk53 commented Jul 31, 2020

FYI, if you rebase on #42214 the OSS mobile CIs should pass.

@ngimel could you please take a look at my question on #42214 (comment) - 1) what's the reason we want to keep the old schema as "_convolution.deprecated" (cannot non-mobile case handle BC with the default value?) 2) if we do want to keep the old schema, can we give the new schema a different overload instead? This way we can workaround internal lite-interpreter BC problems easily.

@ngimel
Copy link
Collaborator

ngimel commented Aug 1, 2020

@zasdfgbnm per @ljk53's point, it makes sense instead of creating new overload for _convolution just add an argument with default value - server bc should be able to handle it, and that should make mobile fixes easier?
@ljk53, would just changing _convolution signature and not adding overload make lite interpreter fixes more straightforward?

iseeyuan added a commit that referenced this pull request Aug 3, 2020
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. 


[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Aug 3, 2020
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function. 


[ghstack-poisoned]
@ljk53
Copy link
Contributor

ljk53 commented Aug 4, 2020

@ljk53, would just changing _convolution signature and not adding overload make lite interpreter fixes more straightforward?

No - it's the same for the purpose of fixing lite-interp. I asked just to check my understanding. Sounds there was legit reason not to pick a true/false default value for the new bool param.

We have split conversions at different PRs, so here is the summary of status:

  1. [pytorch] include all overloads for OSS custom build #42214 already fixed OSS mobile CIs on this PR (yeah!)
  2. For internal mobile BC CIs, we have several different options:
    2a) Martin's PR Mobile backward compatibility #42413 - which creates an adapter for mobile interpreter to handle the old _convolution used in old model;
    2b) @ezyang 's comment: [pytorch] include all overloads for OSS custom build #42214 (comment) - have we thought about not modifying the signature of "_convolution"? (e.g. keep reading it from global state in _convolution?)
    2c) If we adding overload instead of modifying overload (from "" to "deprecated"):
  original _convolution -> keep and remain unchanged (instead of changing it to ".deprecated")
  new _convolution -> call it _convolution.tf32 or _convolution.v2

I mentioned 2c) just to show a possible way to workaround lite-interp BC problem in general but it might not the most reasonable fix for this specific case. 2b) might be a reasonable thing to do but I'll let you guys decide. Since Martin has prepared 2a) we can go with it if we decide to add the new param.

iseeyuan added a commit that referenced this pull request Aug 21, 2020
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.

Differential Revision: [D22898314](https://our.internmc.facebook.com/intern/diff/D22898314)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Aug 21, 2020
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.

Differential Revision: [D22898314](https://our.internmc.facebook.com/intern/diff/D22898314)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Aug 21, 2020
Summary:
Pull Request resolved: #42413

When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, #40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.

Test Plan: Imported from OSS

Reviewed By: xcheng16

Differential Revision: D22898314

Pulled By: iseeyuan

fbshipit-source-id: 90d339f8e1365f4bb178db8db7c147390173372b
@iseeyuan
Copy link
Contributor

#42413 has been landed. Please bump kProducedBytecodeVersion (in caffe2/serialize/inline_container.h) to 0x4L in this PR so that it will not break backward compatibility for mobile.

@zasdfgbnm
Copy link
Collaborator Author

@iseeyuan Great! Thanks for the reminder!

@zasdfgbnm
Copy link
Collaborator Author

This should be ready?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Sep 1, 2020
Summary: Pull Request resolved: #40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
@zasdfgbnm zasdfgbnm deleted the ci-all/tf32-cudnn branch September 2, 2020 17:07
geoffberry pushed a commit to geoffberry/glow that referenced this pull request Nov 8, 2020
Summary: Pull Request resolved: pytorch/pytorch#40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.