-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified
#88922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88922
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 92f310e: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
57015d9 to
31592d2
Compare
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
31592d2 to
eac44f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some unit tests for this parsing logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
hostname = param_hostname if param_hostname else ''
or
hostname = param_hostname or ''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoint could be a different host right?
like if you rendezvous with endpoint == "rank0" on rank 1 machine we wouldn't want to grab rank0 as the current address? am I misunderstanding this?
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
eac44f0 to
a544a41
Compare
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
a544a41 to
f64e446
Compare
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
f64e446 to
7ec0fa3
Compare
d4l3k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM see internal comments
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
7ec0fa3 to
84a0ee6
Compare
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
84a0ee6 to
b2c2025
Compare
…to allow skip local IP lookup if specified (pytorch#88922) Summary: Pull Request resolved: pytorch#88922 Update dynamic renderzvous nodes to use rendezvous hostname if provided. Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: ``` % python -m torch.distributed.run --help usage: run.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE] [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT] [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone] [--max_restarts MAX_RESTARTS] [--monitor_interval MONITOR_INTERVAL] [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no_python] [--run_path] [--log_dir LOG_DIR] [-r % python -m torch.distributed.run --local_addr 192.168.0.84 --rdzv_backend 'c10d' training_script.py NOTE: Redirects are currently not supported in Windows or MacOs. Hello World % python -m torch.distributed.run --rdzv_backend 'c10d' training_script.py NOTE: Redirects are currently not supported in Windows or MacOs. Hello World ``` Unit tests with the 2 added tests. Reviewed By: d4l3k Differential Revision: D41204028 fbshipit-source-id: 6dbf07c10f19c44e691e7dcaa4cb3df62c735a98
|
This pull request was exported from Phabricator. Differential Revision: D41204028 |
b2c2025 to
92f310e
Compare
local_addr param to allow skip local IP lookup if specified
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: #85300
Before:
For dynamic renderzvous, it always grab the
fqdnfrom socket for each node even if user specified the address.For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
Now:
If user specifies the hostname, each node will respect the given hostname.
For example,
socket.getfqdn(<hostname>)Test Plan: Unit tests.
Differential Revision: D41204028