Skip to content

dist/debug: add TCPStore debug page#169095

Closed
d4l3k wants to merge 4 commits intogh/d4l3k/2/basefrom
gh/d4l3k/2/head
Closed

dist/debug: add TCPStore debug page#169095
d4l3k wants to merge 4 commits intogh/d4l3k/2/basefrom
gh/d4l3k/2/head

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Nov 26, 2025

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169095

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a2ed8b7 with merge base 641cdb6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169096

2 similar comments
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169096

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169096

d4l3k added 2 commits December 1, 2025 16:12
[ghstack-poisoned]
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169096

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169147

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #169144

pytorchmergebot pushed a commit that referenced this pull request Dec 2, 2025
This uses `aiohttp` to run all requests concurrently. This cuts the latency at 10k from `15s -> 5s` and is `50s` at 100k. I expect that 100k number is a little sus given I was running this on a single machine with only 4 workers.

Test plan:

patch fetch_all to do 100k requests instead
Pull Request resolved: #169096
Approved by: https://github.com/fduwjj
ghstack dependencies: #169095
pytorchmergebot pushed a commit that referenced this pull request Dec 2, 2025
This adds FlightRecorder trace analysis using frtrace to the debug server.

Test plan:

<img width="2875" height="1295" alt="20251126_14h58m19s_grim" src="https://github.com/user-attachments/assets/4f285405-0f2f-4988-871f-85af1fe286b3" />
Pull Request resolved: #169144
Approved by: https://github.com/fduwjj
ghstack dependencies: #169095, #169096
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This adds a TCPStore debug page.

Test plan:

run debug server

[
<img width="1412" height="617" alt="20251125_17h23m00s_grim" src="https://github.com/user-attachments/assets/8557b239-c397-4d37-ae04-53a42d4096da" />
](url)

Pull Request resolved: #169095
Approved by: https://github.com/fduwjj
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This uses `aiohttp` to run all requests concurrently. This cuts the latency at 10k from `15s -> 5s` and is `50s` at 100k. I expect that 100k number is a little sus given I was running this on a single machine with only 4 workers.

Test plan:

patch fetch_all to do 100k requests instead
Pull Request resolved: #169096
Approved by: https://github.com/fduwjj
ghstack dependencies: #169095
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This adds FlightRecorder trace analysis using frtrace to the debug server.

Test plan:

<img width="2875" height="1295" alt="20251126_14h58m19s_grim" src="https://github.com/user-attachments/assets/4f285405-0f2f-4988-871f-85af1fe286b3" />
Pull Request resolved: #169144
Approved by: https://github.com/fduwjj
ghstack dependencies: #169095, #169096
tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request Dec 10, 2025
@github-actions github-actions bot deleted the gh/d4l3k/2/head branch January 2, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants