Skip to content

control_plane: add handler for WaitCounters#167871

Closed
d4l3k wants to merge 1 commit intopytorch:mainfrom
d4l3k:export-D87095718
Closed

control_plane: add handler for WaitCounters#167871
d4l3k wants to merge 1 commit intopytorch:mainfrom
d4l3k:export-D87095718

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Nov 14, 2025

Summary:
This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:

//caffe2/test/distributed/elastic:test_control_plane

Differential Revision: D87095718

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @pragupta @msaroufim @dcci

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167871

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a0736b1 with merge base 1c0bf2a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 14, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 14, 2025

@d4l3k has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87095718.

@pytorch-bot pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

d4l3k added a commit to d4l3k/pytorch that referenced this pull request Nov 14, 2025
Summary:

This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:
```
//caffe2/test/distributed/elastic:test_control_plane
```

Differential Revision: D87095718
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

d4l3k added a commit to d4l3k/pytorch that referenced this pull request Nov 14, 2025
Summary:

This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:
```
//caffe2/test/distributed/elastic:test_control_plane
```

Differential Revision: D87095718
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

data->total_time_us.load(std::memory_order_relaxed);
counter_obj["max_time_us"] =
data->max_time_us.load(std::memory_order_relaxed);
j[name] = counter_obj;
Copy link
Collaborator

@Skylion007 Skylion007 Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt the json object is trivially copyable.

Suggested change
j[name] = counter_obj;
j[name] = std::move(counter_obj);

explicit TrackingBackend(std::string key) : key_(std::move(key)) {
// Get or create counter data for this key
getCounterDataMapHolder()->map.withLock([&](auto& map) {
auto it = map.find(key_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am wondering if there is a way looking up the key iterator twice in C++17. There is try_emplace but that make_shared_ptr semantics complicates that bit and we don't want to unconditionally create the shared pointer.

This is fine as is though

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2025
Summary:

This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:
```
//caffe2/test/distributed/elastic:test_control_plane
```

Differential Revision: D87095718
@meta-codesync
Copy link

meta-codesync bot commented Nov 19, 2025

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D87095718.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
Summary:
This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.

To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.

Test Plan:
```
//caffe2/test/distributed/elastic:test_control_plane
```

Differential Revision: D87095718

Pull Request resolved: #167871
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants