Skip to content

Conversation

@cz-37
Copy link
Contributor

@cz-37 cz-37 commented Nov 12, 2022

Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: #85300

Before:
For dynamic renderzvous, it always grab the fqdn from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256

return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, socket.getfqdn(<hostname>)

Test Plan: Unit tests.

Differential Revision: D41204028

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 12, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: cz-37 (a544a41a621772cd3bda4cd8fa69d2e73d60189d)

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 12, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88922

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 92f310e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some unit tests for this parsing logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

hostname = param_hostname if param_hostname else ''

or

hostname = param_hostname or ''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint could be a different host right?

like if you rendezvous with endpoint == "rank0" on rank 1 machine we wouldn't want to grab rank0 as the current address? am I misunderstanding this?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@cz-37 cz-37 requested a review from wanchaol as a code owner December 1, 2022 05:02
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM see internal comments

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

…to allow skip local IP lookup if specified (pytorch#88922)

Summary:
Pull Request resolved: pytorch#88922

Update dynamic renderzvous nodes to use rendezvous hostname if provided.

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan:
```
% python -m torch.distributed.run  --help
usage: run.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE] [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT] [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone] [--max_restarts MAX_RESTARTS] [--monitor_interval MONITOR_INTERVAL]
              [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no_python] [--run_path] [--log_dir LOG_DIR] [-r

% python -m torch.distributed.run  --local_addr 192.168.0.84 --rdzv_backend 'c10d'  training_script.py
NOTE: Redirects are currently not supported in Windows or MacOs.
Hello World

% python -m torch.distributed.run   --rdzv_backend 'c10d'  training_script.py
NOTE: Redirects are currently not supported in Windows or MacOs.
Hello World

```

Unit tests with the 2 added tests.

Reviewed By: d4l3k

Differential Revision: D41204028

fbshipit-source-id: 6dbf07c10f19c44e691e7dcaa4cb3df62c735a98
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41204028

@cz-37 cz-37 changed the title Update dynamic renderzvous nodes to use rendezvous hostname if provided Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified Dec 20, 2022
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 21, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants