control_plane: add handler for WaitCounters#167871
control_plane: add handler for WaitCounters#167871d4l3k wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167871
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a0736b1 with merge base 1c0bf2a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The label |
|
The label |
727b507 to
422354f
Compare
Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718
|
The label |
Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718
422354f to
ea25537
Compare
|
The label |
| data->total_time_us.load(std::memory_order_relaxed); | ||
| counter_obj["max_time_us"] = | ||
| data->max_time_us.load(std::memory_order_relaxed); | ||
| j[name] = counter_obj; |
There was a problem hiding this comment.
I doubt the json object is trivially copyable.
| j[name] = counter_obj; | |
| j[name] = std::move(counter_obj); |
| explicit TrackingBackend(std::string key) : key_(std::move(key)) { | ||
| // Get or create counter data for this key | ||
| getCounterDataMapHolder()->map.withLock([&](auto& map) { | ||
| auto it = map.find(key_); |
There was a problem hiding this comment.
Am wondering if there is a way looking up the key iterator twice in C++17. There is try_emplace but that make_shared_ptr semantics complicates that bit and we don't want to unconditionally create the shared pointer.
This is fine as is though
Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718
ea25537 to
a0736b1
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP. To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation. Test Plan: ``` //caffe2/test/distributed/elastic:test_control_plane ``` Differential Revision: D87095718 Pull Request resolved: #167871 Approved by: https://github.com/fduwjj
Summary:
This adds a DebugServer handler for WaitCounters such that we can access all wait counters live via HTTP.
To do so we register a wait counter backend that tracks all counter values in a shared synchronized map. When creating a counter this will acquire the lock to add it to the global map but during runtime it only uses atomic operation.
Test Plan:
Differential Revision: D87095718
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @pragupta @msaroufim @dcci