Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Mar 31, 2022

Stack from ghstack (oldest at bottom):

Reland #74452

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: D35287501

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 31, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 01e3c46 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-05T14:29:19.6988886Z The PR is introduc...m to confirm whether this change is wanted or not.
2022-04-05T14:29:19.6976754Z processing existing schema:  text(__torch__.torch.classes.profiling.SourceRef _0) -> (str _0)
2022-04-05T14:29:19.6978200Z processing existing schema:  count(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-05T14:29:19.6979857Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-05T14:29:19.6980285Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> (__torch__.torch.classes.profiling.SourceRef _0)
2022-04-05T14:29:19.6981907Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> (Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0)
2022-04-05T14:29:19.6982784Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-05T14:29:19.6984127Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-05T14:29:19.6985303Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-05T14:29:19.6987031Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> (__torch__.torch.classes.profiling.SourceStats[] _0)
2022-04-05T14:29:19.6988535Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (NoneType _0)
2022-04-05T14:29:19.6988886Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-04-05T14:29:19.6988900Z 
2022-04-05T14:29:19.6988977Z Broken ops: [
2022-04-05T14:29:19.6989224Z 	quantized::softmax(Tensor qx, int dim, float output_scale, int output_zero_point) -> (Tensor)
2022-04-05T14:29:19.6989289Z ]
2022-04-05T14:29:19.8019105Z + cleanup
2022-04-05T14:29:19.8019519Z + retcode=1
2022-04-05T14:29:19.8019589Z + set +x
2022-04-05T14:29:19.8059390Z ##[error]Process completed with exit code 1.
2022-04-05T14:29:19.8106133Z Prepare all required actions
2022-04-05T14:29:19.8106310Z Getting action download info

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Mar 31, 2022
rohan-varma added a commit that referenced this pull request Mar 31, 2022
Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

ghstack-source-id: 152704893
Pull Request resolved: #75024
@rohan-varma rohan-varma changed the title Back out "Revert D35000703: [WIP][FSDP] Mixed precision enablement" [Reland][FSDP] Mixed precision enablement" Mar 31, 2022
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Do we need to mention this in our docs?

Reland #74452

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

[ghstack-poisoned]
@rohan-varma rohan-varma added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 4, 2022
@rohan-varma
Copy link
Contributor Author

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Do we need to mention this in our docs?

Yep, we should definitely add this. Filed #75215

Reland #74452

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

[ghstack-poisoned]
Reland #74452

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 4, 2022
Pull Request resolved: #75024

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703
ghstack-source-id: 152989929

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)
@rohan-varma rohan-varma added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/trunk Trigger trunk jobs on your pull request labels Apr 4, 2022
Reland #74452

Issue was older nccl version does not support bf16. Will take an approach similar to #67843 to ensure test only runs with later nccl versions.

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703

Differential Revision: [D35287501](https://our.internmc.facebook.com/intern/diff/D35287501/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Apr 5, 2022
…75024)

Summary:
Pull Request resolved: #75024

Original commit changeset: 99295ea4ff02

Original Phabricator Diff: D35000703 (6b0b088)
ghstack-source-id: 153059190

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D35287501

fbshipit-source-id: c6c9ada039de27cf9cc477561f92a7f888bdf5f7
@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2022

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants