Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions#164591
Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions#164591sbak5 wants to merge 6 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164591
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 080f074 with merge base 3c59351 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@fduwjj This PR includes changes discussed about the knobs and extra dumping on failures. |
There was a problem hiding this comment.
Thanks for contributing to FR code and thanks for your feedback. Overall the change looks reasonable, but I am a little bit concerned that adding more dumps might interfere with existing timeout dump, especially it would override the existing dump file. So can we set the default value of TORCH_NCCL_EXTRA_DUMP_ON_EXEC to be false, so you can directly turn on it on your end and we can gradually roll it out on the Meta side?
| // Whether to include stack trace in the Flight Recorder trace (default true) | ||
| static std::vector<std::string> TORCH_NCCL_INCLUDE_STACK_TRACE = { | ||
| "TORCH_NCCL_INCLUDE_STACK_TRACE"}; | ||
|
|
||
| // Whether to include only active collectives in the Flight Recorder trace (default false) | ||
| static std::vector<std::string> TORCH_NCCL_INCLUDE_ONLY_ACTIVE = { | ||
| "TORCH_NCCL_INCLUDE_ONLY_ACTIVE"}; |
There was a problem hiding this comment.
Also since FR now is a separate module and is generic, can we not make it NCCL specific? WDYT? I mean just TORCH_INCLUDE_STACK_TRACE. Ideally these two ENV values need to be defined in torch/csrc/distributed/c10d/FlightRecorder.hpp.
There was a problem hiding this comment.
Thanks for the suggestion. I think that should happen in a separate PR which may include some refactoring of other env vars in ProcessGroupNCCL.hpp to FlightRecorder.hpp
There was a problem hiding this comment.
Moved the env vars to FlightRecorder.hpp
It doesn't trigger duplicate dumping due to |
fduwjj
left a comment
There was a problem hiding this comment.
Overall, this looks reasonable to me and I am ok to merge this change for now. But like I mention in the comment, there could be potential race condition in the dump for the same rank which you need to be careful about.
| }; | ||
|
|
||
| // Whether to include stack trace in the Flight Recorder trace (default true) | ||
| static std::vector<std::string> TORCH_INCLUDE_STACK_TRACE = { |
| bool dumpStackTrace = getCvarBool(TORCH_INCLUDE_STACK_TRACE, true); | ||
| bool onlyActive = getCvarBool(TORCH_INCLUDE_ONLY_ACTIVE, false); |
There was a problem hiding this comment.
nit: not a blocking, usually inside PGNCCL, we will first check the ENV value and store it as a member, so that we don't check it every time when we call this function. Unless, you plan to dynamically change the value of this ENV.
| bool dumpExtraOnExec_ = getCvarBool(TORCH_NCCL_EXTRA_DUMP_ON_EXEC, false); | ||
| if (dumpExtraOnExec_) { | ||
| bool should_dump_local = false; | ||
| bool succeded = shouldDump_.compare_exchange_strong( |
There was a problem hiding this comment.
Well, when you have multiple PGs, compare_exchange_strong only works on one object but the dump (writer) is global (it is a singleton because FR itself is singleton). I know the current design is a bit hacky, and we do want to consolidate the watchdog thread and monitor thread to be per class. But it is a different story from this PR, that is why I am a bit worried about having this might override the more completed dump.
There was a problem hiding this comment.
This is intended to happen only on the first PG observing exception so any other PG, which also sees the same CUDA exceptions, doesn't try to dump it. In the current default path, dumping happens only on PG 0.
the shouldDump_ seems to be shared by all threads so this routine prevents any duplicate dumping on other process groups than the first PG. All ranks try the CAS and one PG with succeded proceed to broadcast and dump. The PG 0 doesn't do dumping.
the local variable is simply used for a reference value. What's actually updated is shouldDump_
|
Whenever you are ready to merge this PR. you can @pytorchbot and this bot will merge the PR for you. More info can be checked here: https://github.com/pytorch/pytorch/wiki/Bot-commands |
219b410 to
080f074
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ves) and trigger FR even on any exceptions (pytorch#164591) This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog - New knobs to control FR dump as suggested in the public documentation even for watchdog (TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE) - Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error (TORCH_NCCL_EXTRA_DUMP_ON_EXEC) -> Can be used as a snapshot of the workload progress for post-mortem analysis Pull Request resolved: pytorch#164591 Approved by: https://github.com/fduwjj
…gered FR dump (pytorch#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591 Test Plan: eyes Differential Revision: D86248484
…gered FR dump (pytorch#167023) [Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (pytorch#167023) Summary: We should also retry if include stacktraces failed. Changed was introduced in pytorch#164591 Test Plan: eyes Reviewed By: fduwjj Differential Revision: D86248484
|
The name is too general. Would be better to use something like |
This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
(TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela