[Inductor] [Triton] Capture Timeout errors without crashing the job#169064
[Inductor] [Triton] Capture Timeout errors without crashing the job#169064njriasan wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169064
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 9566d34 with merge base 9f7fceb ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@PaulZhang12 An argument can be made that we would rather let the jobs crash early than waste compute with a successful job reaching the compile timeout. Should we gate this behavior behind an environment variable and solely use it for debugging? |
…ytorch#169064) Summary: Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs. We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options. Test Plan: Tested with local model reproducers. Differential Revision: D87866423
0f332d3 to
4186353
Compare
PaulZhang12
left a comment
There was a problem hiding this comment.
Overall, looks good to me!
| from torch._inductor.codegen.cuda.cuda_kernel import ( | ||
| CUDATemplateCaller, | ||
| ) | ||
| futures[future].mark_failed() |
There was a problem hiding this comment.
nit: I see this line duplicated in the if else block below, can move back up here?
There was a problem hiding this comment.
Let me take a look. This is AI driven so I likely missed a code quality issue. Thanks for flagging this.
There was a problem hiding this comment.
Looks like a quality issue in the old code. Seems like the if/else should just be deleted. Good find!
…ytorch#169064) Summary: Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs. We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options. Test Plan: Tested with local model reproducers. Reviewed By: PaulZhang12 Differential Revision: D87866423
4186353 to
fd6a32b
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
…ytorch#169064) Summary: Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs. We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options. Test Plan: Tested with local model reproducers. Reviewed By: PaulZhang12 Differential Revision: D87866423
fd6a32b to
9566d34
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…169064) Summary: Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs. We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options. Test Plan: Tested with local model reproducers. Differential Revision: D87866423 Pull Request resolved: #169064 Approved by: https://github.com/PaulZhang12
Summary:
Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.
We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.
Test Plan: Tested with local model reproducers.
Differential Revision: D87866423
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo