Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions by sbak5 · Pull Request #164591 · pytorch/pytorch

sbak5 · 2025-10-03T18:55:40Z

This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog

New knobs to control FR dump as suggested in the public documentation even for watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
(TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-10-03T18:55:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164591

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 080f074 with merge base 3c59351 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-10-03T18:55:49Z

The committers listed above are authorized under a signed CLA.

✅ login: sbak5 / name: Seonmyeong Bak (080f074, 54ebfd0, 5bf7600, 8a1ccce, 8d80592, e902ee8)

sbak5 · 2025-10-03T19:19:49Z

@fduwjj This PR includes changes discussed about the knobs and extra dumping on failures.

fduwjj

Thanks for contributing to FR code and thanks for your feedback. Overall the change looks reasonable, but I am a little bit concerned that adding more dumps might interfere with existing timeout dump, especially it would override the existing dump file. So can we set the default value of TORCH_NCCL_EXTRA_DUMP_ON_EXEC to be false, so you can directly turn on it on your end and we can gradually roll it out on the Meta side?

fduwjj · 2025-10-03T21:18:51Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

+// Whether to include stack trace in the Flight Recorder trace (default true)
+static std::vector<std::string> TORCH_NCCL_INCLUDE_STACK_TRACE = {
+    "TORCH_NCCL_INCLUDE_STACK_TRACE"};
+
+// Whether to include only active collectives in the Flight Recorder trace (default false)
+static std::vector<std::string> TORCH_NCCL_INCLUDE_ONLY_ACTIVE = {
+    "TORCH_NCCL_INCLUDE_ONLY_ACTIVE"};


Also since FR now is a separate module and is generic, can we not make it NCCL specific? WDYT? I mean just TORCH_INCLUDE_STACK_TRACE. Ideally these two ENV values need to be defined in torch/csrc/distributed/c10d/FlightRecorder.hpp.

Thanks for the suggestion. I think that should happen in a separate PR which may include some refactoring of other env vars in ProcessGroupNCCL.hpp to FlightRecorder.hpp

Moved the env vars to FlightRecorder.hpp

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

sbak5 · 2025-10-03T21:26:14Z

Thanks for contributing to FR code and thanks for your feedback. Overall the change looks reasonable, but I am a little bit concerned that adding more dumps might interfere with existing timeout dump, especially it would override the existing dump file. So can we set the default value of TORCH_NCCL_EXTRA_DUMP_ON_EXEC to be false, so you can directly turn on it on your end and we can gradually roll it out on the Meta side?

It doesn't trigger duplicate dumping due to compare_exchange (only extra dumping is triggered when this atomic CAS is sucessful) but I agree with you setting the var to false. Pushed a change.

…corder.hpp`

fduwjj

Overall, this looks reasonable to me and I am ok to merge this change for now. But like I mention in the comment, there could be potential race condition in the dump for the same rank which you need to be careful about.

fduwjj · 2025-10-08T22:14:14Z

torch/csrc/distributed/c10d/FlightRecorder.hpp

 };

+// Whether to include stack trace in the Flight Recorder trace (default true)
+static std::vector<std::string> TORCH_INCLUDE_STACK_TRACE = {


fduwjj · 2025-10-08T22:17:11Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    bool dumpStackTrace = getCvarBool(TORCH_INCLUDE_STACK_TRACE, true);
+    bool onlyActive = getCvarBool(TORCH_INCLUDE_ONLY_ACTIVE, false);


nit: not a blocking, usually inside PGNCCL, we will first check the ENV value and store it as a member, so that we don't check it every time when we call this function. Unless, you plan to dynamically change the value of this ENV.

fduwjj · 2025-10-08T22:21:24Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  bool dumpExtraOnExec_ = getCvarBool(TORCH_NCCL_EXTRA_DUMP_ON_EXEC, false);
+  if (dumpExtraOnExec_) {
+    bool should_dump_local = false;
+    bool succeded = shouldDump_.compare_exchange_strong(


Well, when you have multiple PGs, compare_exchange_strong only works on one object but the dump (writer) is global (it is a singleton because FR itself is singleton). I know the current design is a bit hacky, and we do want to consolidate the watchdog thread and monitor thread to be per class. But it is a different story from this PR, that is why I am a bit worried about having this might override the more completed dump.

This is intended to happen only on the first PG observing exception so any other PG, which also sees the same CUDA exceptions, doesn't try to dump it. In the current default path, dumping happens only on PG 0.
the shouldDump_ seems to be shared by all threads so this routine prevents any duplicate dumping on other process groups than the first PG. All ranks try the CAS and one PG with succeded proceed to broadcast and dump. The PG 0 doesn't do dumping.
the local variable is simply used for a reference value. What's actually updated is shouldDump_

fduwjj · 2025-10-08T22:23:33Z

Whenever you are ready to merge this PR. you can @pytorchbot and this bot will merge the PR for you. More info can be checked here: https://github.com/pytorch/pytorch/wiki/Bot-commands

sbak5 · 2025-10-09T02:27:26Z

@pytorchbot merge

pytorchmergebot · 2025-10-09T02:29:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ves) and trigger FR even on any exceptions (pytorch#164591) This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog - New knobs to control FR dump as suggested in the public documentation even for watchdog (TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE) - Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error (TORCH_NCCL_EXTRA_DUMP_ON_EXEC) -> Can be used as a snapshot of the workload progress for post-mortem analysis Pull Request resolved: pytorch#164591 Approved by: https://github.com/fduwjj

…gered FR dump (pytorch#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591 Test Plan: eyes Differential Revision: D86248484

…gered FR dump (#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in #164591 Test Plan: eyes Reviewed By: fduwjj Differential Revision: D86248484

…gered FR dump (#167023) [Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in #164591 Test Plan: eyes Reviewed By: fduwjj Differential Revision: D86248484

…gered FR dump (pytorch#167023) [Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (pytorch#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591 Test Plan: eyes Reviewed By: fduwjj Differential Revision: D86248484

ppwwyyxx · 2025-12-08T02:27:05Z

The name is too general. Would be better to use something like TORCH_NCCL_DUMP_INCLUDE_STACK_TRACE

sbak5 requested review from a team, Aidyn-A, EikanWang, IvanYashchuk, albanD, angelayi, avikchaudhuri, eqy, gujinghui, justinchuby, lezcano, nikitaved, soulitzer, sraikund16, syed-ahmed, titaiwangms, tugsbayasgalan, ydwu4 and zhxchen17 as code owners October 3, 2025 18:55

pytorch-bot bot added module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: releng release notes category release notes: inductor (aoti) labels Oct 3, 2025

facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue module: rocm AMD GPU support for Pytorch labels Oct 3, 2025

albanD removed their request for review October 3, 2025 18:58

sbak5 marked this pull request as draft October 3, 2025 18:59

sbak5 marked this pull request as ready for review October 3, 2025 19:15

fduwjj reviewed Oct 3, 2025

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp Show resolved Hide resolved

Set the default value of TORCH_NCCL_EXTRA_DUPM_ON_EXEC to false

54ebfd0

sbak5 added 3 commits October 6, 2025 18:26

Applied Linting

e902ee8

Moving TORCH_NCCL_INCLUDE* env vars to TORCH_INCLUDE at `FlightRe…

5bf7600

…corder.hpp`

Set the default TORCH_NCCL_EXTRA_DUMP_ON_EXEC to false

8d80592

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 7, 2025

fduwjj approved these changes Oct 8, 2025

View reviewed changes

sbak5 requested a review from fduwjj October 8, 2025 22:55

Fix Lint issue

080f074

sbak5 force-pushed the sbak/watchdog_fix_v2.9.0 branch from 219b410 to 080f074 Compare October 8, 2025 23:42

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2025

pytorchmergebot added the merging label Oct 9, 2025

pytorchmergebot added the Merged label Oct 9, 2025

pytorchmergebot closed this in 263db92 Oct 9, 2025

pytorchmergebot removed the merging label Oct 9, 2025

		bool dumpStackTrace = getCvarBool(TORCH_INCLUDE_STACK_TRACE, true);
		bool onlyActive = getCvarBool(TORCH_INCLUDE_ONLY_ACTIVE, false);

Conversation

sbak5 commented Oct 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164591

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbak5 commented Oct 3, 2025

Uh oh!

fduwjj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

sbak5 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

sbak5 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sbak5 commented Oct 3, 2025

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

sbak5 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Oct 8, 2025

Uh oh!

sbak5 commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Merge started

Uh oh!

ppwwyyxx commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sbak5 commented Oct 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 3, 2025 •

edited

Loading

linux-foundation-easycla bot commented Oct 3, 2025 •

edited

Loading

fduwjj left a comment •

edited

Loading

sbak5 Oct 8, 2025 •

edited

Loading