Skip to content

[Inductor] [Triton] Capture Timeout errors without crashing the job#169064

Closed
njriasan wants to merge 1 commit intopytorch:mainfrom
njriasan:export-D87866423
Closed

[Inductor] [Triton] Capture Timeout errors without crashing the job#169064
njriasan wants to merge 1 commit intopytorch:mainfrom
njriasan:export-D87866423

Conversation

@njriasan
Copy link
Contributor

@njriasan njriasan commented Nov 25, 2025

Summary:
Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Differential Revision: D87866423

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169064

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9566d34 with merge base 9f7fceb (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Nov 25, 2025

@njriasan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87866423.

@njriasan
Copy link
Contributor Author

njriasan commented Dec 1, 2025

@PaulZhang12 An argument can be made that we would rather let the jobs crash early than waste compute with a successful job reaching the compile timeout. Should we gate this behavior behind an environment variable and solely use it for debugging?

njriasan added a commit to njriasan/pytorch that referenced this pull request Dec 1, 2025
…ytorch#169064)

Summary:

Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Differential Revision: D87866423
@njriasan njriasan added module: logging Features which make it easier to tell what PyTorch is doing under the hood topic: not user facing topic category labels Dec 1, 2025
Copy link
Contributor

@PaulZhang12 PaulZhang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me!

from torch._inductor.codegen.cuda.cuda_kernel import (
CUDATemplateCaller,
)
futures[future].mark_failed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see this line duplicated in the if else block below, can move back up here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look. This is AI driven so I likely missed a code quality issue. Thanks for flagging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a quality issue in the old code. Seems like the if/else should just be deleted. Good find!

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 1, 2025
njriasan added a commit to njriasan/pytorch that referenced this pull request Dec 1, 2025
…ytorch#169064)

Summary:

Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Reviewed By: PaulZhang12

Differential Revision: D87866423
@njriasan
Copy link
Contributor Author

njriasan commented Dec 1, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

…ytorch#169064)

Summary:

Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Reviewed By: PaulZhang12

Differential Revision: D87866423
@njriasan
Copy link
Contributor Author

njriasan commented Dec 1, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…169064)

Summary:
Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Differential Revision: D87866423

Pull Request resolved: #169064
Approved by: https://github.com/PaulZhang12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: inductor module: logging Features which make it easier to tell what PyTorch is doing under the hood topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants