Skip to content

Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions#164591

Closed
sbak5 wants to merge 6 commits intopytorch:mainfrom
sbak5:sbak/watchdog_fix_v2.9.0
Closed

Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions#164591
sbak5 wants to merge 6 commits intopytorch:mainfrom
sbak5:sbak/watchdog_fix_v2.9.0

Conversation

@sbak5
Copy link
Contributor

@sbak5 sbak5 commented Oct 3, 2025

This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog

  • New knobs to control FR dump as suggested in the public documentation even for watchdog
    (TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
  • Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
    (TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
    -> Can be used as a snapshot of the workload progress for post-mortem analysis

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164591

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 080f074 with merge base 3c59351 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: releng release notes category release notes: inductor (aoti) labels Oct 3, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 3, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue module: rocm AMD GPU support for Pytorch labels Oct 3, 2025
@albanD albanD removed their request for review October 3, 2025 18:58
@sbak5 sbak5 marked this pull request as draft October 3, 2025 18:59
@sbak5 sbak5 marked this pull request as ready for review October 3, 2025 19:15
@sbak5
Copy link
Contributor Author

sbak5 commented Oct 3, 2025

@fduwjj This PR includes changes discussed about the knobs and extra dumping on failures.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to FR code and thanks for your feedback. Overall the change looks reasonable, but I am a little bit concerned that adding more dumps might interfere with existing timeout dump, especially it would override the existing dump file. So can we set the default value of TORCH_NCCL_EXTRA_DUMP_ON_EXEC to be false, so you can directly turn on it on your end and we can gradually roll it out on the Meta side?

Comment on lines 129 to 135
// Whether to include stack trace in the Flight Recorder trace (default true)
static std::vector<std::string> TORCH_NCCL_INCLUDE_STACK_TRACE = {
"TORCH_NCCL_INCLUDE_STACK_TRACE"};

// Whether to include only active collectives in the Flight Recorder trace (default false)
static std::vector<std::string> TORCH_NCCL_INCLUDE_ONLY_ACTIVE = {
"TORCH_NCCL_INCLUDE_ONLY_ACTIVE"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since FR now is a separate module and is generic, can we not make it NCCL specific? WDYT? I mean just TORCH_INCLUDE_STACK_TRACE. Ideally these two ENV values need to be defined in torch/csrc/distributed/c10d/FlightRecorder.hpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I think that should happen in a separate PR which may include some refactoring of other env vars in ProcessGroupNCCL.hpp to FlightRecorder.hpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the env vars to FlightRecorder.hpp

@sbak5
Copy link
Contributor Author

sbak5 commented Oct 3, 2025

Thanks for contributing to FR code and thanks for your feedback. Overall the change looks reasonable, but I am a little bit concerned that adding more dumps might interfere with existing timeout dump, especially it would override the existing dump file. So can we set the default value of TORCH_NCCL_EXTRA_DUMP_ON_EXEC to be false, so you can directly turn on it on your end and we can gradually roll it out on the Meta side?

It doesn't trigger duplicate dumping due to compare_exchange (only extra dumping is triggered when this atomic CAS is sucessful) but I agree with you setting the var to false. Pushed a change.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 7, 2025
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks reasonable to me and I am ok to merge this change for now. But like I mention in the comment, there could be potential race condition in the dump for the same rank which you need to be careful about.

};

// Whether to include stack trace in the Flight Recorder trace (default true)
static std::vector<std::string> TORCH_INCLUDE_STACK_TRACE = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!!

Comment on lines +1899 to +1900
bool dumpStackTrace = getCvarBool(TORCH_INCLUDE_STACK_TRACE, true);
bool onlyActive = getCvarBool(TORCH_INCLUDE_ONLY_ACTIVE, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not a blocking, usually inside PGNCCL, we will first check the ENV value and store it as a member, so that we don't check it every time when we call this function. Unless, you plan to dynamically change the value of this ENV.

bool dumpExtraOnExec_ = getCvarBool(TORCH_NCCL_EXTRA_DUMP_ON_EXEC, false);
if (dumpExtraOnExec_) {
bool should_dump_local = false;
bool succeded = shouldDump_.compare_exchange_strong(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when you have multiple PGs, compare_exchange_strong only works on one object but the dump (writer) is global (it is a singleton because FR itself is singleton). I know the current design is a bit hacky, and we do want to consolidate the watchdog thread and monitor thread to be per class. But it is a different story from this PR, that is why I am a bit worried about having this might override the more completed dump.

Copy link
Contributor Author

@sbak5 sbak5 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to happen only on the first PG observing exception so any other PG, which also sees the same CUDA exceptions, doesn't try to dump it. In the current default path, dumping happens only on PG 0.
the shouldDump_ seems to be shared by all threads so this routine prevents any duplicate dumping on other process groups than the first PG. All ranks try the CAS and one PG with succeded proceed to broadcast and dump. The PG 0 doesn't do dumping.
the local variable is simply used for a reference value. What's actually updated is shouldDump_

@fduwjj
Copy link
Contributor

fduwjj commented Oct 8, 2025

Whenever you are ready to merge this PR. you can @pytorchbot and this bot will merge the PR for you. More info can be checked here: https://github.com/pytorch/pytorch/wiki/Bot-commands

@sbak5 sbak5 requested a review from fduwjj October 8, 2025 22:55
@sbak5 sbak5 force-pushed the sbak/watchdog_fix_v2.9.0 branch from 219b410 to 080f074 Compare October 8, 2025 23:42
@sbak5
Copy link
Contributor Author

sbak5 commented Oct 9, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ves) and trigger FR even on any exceptions (pytorch#164591)

This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog

- New knobs to control FR dump as suggested in the public documentation even for watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
- Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
  (TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis

Pull Request resolved: pytorch#164591
Approved by: https://github.com/fduwjj
VieEeEw added a commit to VieEeEw/pytorch that referenced this pull request Nov 5, 2025
…gered FR dump (pytorch#167023)

Summary:

We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591

Test Plan: eyes

Differential Revision: D86248484
pytorch-bot bot pushed a commit that referenced this pull request Nov 5, 2025
…gered FR dump (#167023)

Summary:

We should also retry if include stacktraces failed. Changed was introduced in #164591

Test Plan: eyes

Reviewed By: fduwjj

Differential Revision: D86248484
jeanschmidt pushed a commit that referenced this pull request Nov 6, 2025
…gered FR dump (#167023)

[Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (#167023)

Summary:

We should also retry if include stacktraces failed. Changed was introduced in #164591

Test Plan: eyes

Reviewed By: fduwjj

Differential Revision: D86248484
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…gered FR dump (pytorch#167023)

[Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (pytorch#167023)

Summary:

We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591

Test Plan: eyes

Reviewed By: fduwjj

Differential Revision: D86248484
@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Dec 8, 2025

The name is too general. Would be better to use something like TORCH_NCCL_DUMP_INCLUDE_STACK_TRACE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo module: inductor module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue open source release notes: inductor (aoti) release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants