control_plane: add handler for WaitCounters by d4l3k · Pull Request #167871 · pytorch/pytorch

d4l3k · 2025-11-14T21:25:50Z

Summary:
This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:

//caffe2/test/distributed/elastic:test_control_plane

Differential Revision: D87095718

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @pragupta @msaroufim @dcci

pytorch-bot · 2025-11-14T21:25:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167871

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a0736b1 with merge base 1c0bf2a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-11-14T21:25:58Z

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

meta-codesync · 2025-11-14T21:26:03Z

@d4l3k has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87095718.

pytorch-bot · 2025-11-14T21:26:04Z

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718

pytorch-bot · 2025-11-14T23:00:28Z

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718

pytorch-bot · 2025-11-14T23:07:43Z

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

Skylion007 · 2025-11-16T17:53:13Z

torch/csrc/distributed/c10d/control_plane/WaitCounterHandler.cpp

+          data->total_time_us.load(std::memory_order_relaxed);
+      counter_obj["max_time_us"] =
+          data->max_time_us.load(std::memory_order_relaxed);
+      j[name] = counter_obj;


I doubt the json object is trivially copyable.

Suggested change

j[name] = counter_obj;

j[name] = std::move(counter_obj);

Skylion007 · 2025-11-16T18:03:17Z

torch/csrc/distributed/c10d/control_plane/WaitCounterHandler.cpp

+  explicit TrackingBackend(std::string key) : key_(std::move(key)) {
+    // Get or create counter data for this key
+    getCounterDataMapHolder()->map.withLock([&](auto& map) {
+      auto it = map.find(key_);


Am wondering if there is a way looking up the key iterator twice in C++17. There is try_emplace but that make_shared_ptr semantics complicates that bit and we don't want to unconditionally create the shared pointer.

This is fine as is though

fduwjj

LGTM

Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718

meta-codesync · 2025-11-19T21:19:45Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D87095718.

facebook-github-bot · 2025-11-21T16:48:12Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-11-21T16:50:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718 Pull Request resolved: #167871 Approved by: https://github.com/fduwjj

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 14, 2025

pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 14, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 14, 2025

d4l3k force-pushed the export-D87095718 branch from 727b507 to 422354f Compare November 14, 2025 23:00

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025

d4l3k force-pushed the export-D87095718 branch from 422354f to ea25537 Compare November 14, 2025 23:07

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025

Skylion007 reviewed Nov 16, 2025

View reviewed changes

fduwjj approved these changes Nov 18, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2025

d4l3k force-pushed the export-D87095718 branch from ea25537 to a0736b1 Compare November 19, 2025 21:14

pytorchmergebot added the merging label Nov 21, 2025

pytorchmergebot added the Merged label Nov 21, 2025

pytorchmergebot closed this in 107ab1c Nov 21, 2025

pytorchmergebot removed the merging label Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control_plane: add handler for WaitCounters#167871

control_plane: add handler for WaitCounters#167871
d4l3k wants to merge 1 commit intopytorch:mainfrom
d4l3k:export-D87095718

d4l3k commented Nov 14, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

meta-codesync bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Skylion007 Nov 16, 2025 •

edited

Loading

Uh oh!

Skylion007 Nov 16, 2025

Uh oh!

fduwjj left a comment

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

facebook-github-bot commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

d4l3k commented Nov 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167871

✅ No Failures

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

meta-codesync bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

Skylion007 Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

facebook-github-bot commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

d4l3k commented Nov 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 14, 2025 •

edited

Loading

Skylion007 Nov 16, 2025 •

edited

Loading