-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Tests for WorkNCCL::wait with Timeouts #40947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3e25ba8 (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
| // First we reset the NCCL_BLOCKING_WAIT environment variable to prevent | ||
| // any unwanted side-effects. We must do this at all exit-points for this | ||
| // function. | ||
| setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (please feel free to ignore): does it worth it to make this a guard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It's kind of overhead to restore original env value
| setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1); | ||
| // If no exception, that is also an error since we expect the test.wait call | ||
| // to timeout and throw. | ||
| throw std::runtime_error("BOOM!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the message more informative here? Same for the above two throws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually planning on cleaning up these messages, but just followed the same error messages as the already existing tests for now (they all do throw std::runtime_error("BOOM!") for some reason...). I'm planning on putting up another PR that fixes all of these.
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
| // First we reset the NCCL_BLOCKING_WAIT environment variable to prevent | ||
| // any unwanted side-effects. We must do this at all exit-points for this | ||
| // function. | ||
| setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It's kind of overhead to restore original env value
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]
|
This pull request has been merged in 01dcef2. |
Stack from ghstack:
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.
Differential Revision: D22173101