[FSDP][1/N] Split `fully_shard` unit tests #92296

awgu · 2023-01-17T10:41:20Z

Stack from ghstack:

[FSDP][3/N] Refactor summon_full_params unit tests #92298 [FSDP][3/N] Refactor summon_full_params unit tests
[FSDP][2/N] _summon_full_params -> _unshard_params #92297 [FSDP][2/N] _summon_full_params -> _unshard_params
[FSDP][1/N] Split fully_shard unit tests #92296 [FSDP][1/N] Split fully_shard unit tests

This PR splits test_fully_shard.py into fully_shard/test_fully_shard<...>.py. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a TODO for using run_subtests in the model checkpointing unit tests.

[ghstack-poisoned]

pytorch-bot · 2023-01-17T10:41:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92296

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit c5f0d89:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 535f26d Pull Request resolved: pytorch#92296

This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts. The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests. [ghstack-poisoned]

awgu · 2023-01-19T21:34:21Z

test/distributed/_composable/fully_shard/test_fully_shard_model_checkpoint.py

+        E2E test of save + load with rank0_only + CPU offload for TransformerWithSharedParams
+        on the composable path.
+        """
+        self.run_subtests(


This is the only real change, where I knocked out a to-do to use run_subtests.

mrshenli

LGTM

awgu · 2023-01-19T21:48:16Z

@pytorchbot merge

pytorchmergebot · 2023-01-19T21:52:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-19T23:08:23Z

Merge failed

Reason: 2 mandatory check(s) failed (Rule Distributed). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

awgu · 2023-01-20T02:01:03Z

Failures look unrelated:

linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.4xlarge)

ERROR: Could not find a version that satisfies the requirement astunparse (from -r requirements.txt (line 2)) (from versions: none)
ERROR: No matching distribution found for astunparse (from -r requirements.txt (line 2))

linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)

ERROR: Could not find a version that satisfies the requirement astunparse (from -r requirements.txt (line 2)) (from versions: none)
ERROR: No matching distribution found for astunparse (from -r requirements.txt (line 2))

linux-focal-rocm5.3-py3.8 / test (default, 1, 2, linux.rocm.gpu)

======================================================================
2023-01-19T23:02:01.6644719Z ERROR [0.056s]: test_memory_snapshot (__main__.TestCudaComm)
2023-01-19T23:02:01.6645149Z ----------------------------------------------------------------------
2023-01-19T23:02:01.6645477Z Traceback (most recent call last):
2023-01-19T23:02:01.6645808Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1354, in do_open
2023-01-19T23:02:01.6646172Z     h.request(req.get_method(), req.selector, req.data, headers,
2023-01-19T23:02:01.6646531Z   File "/opt/conda/lib/python3.8/http/client.py", line 1256, in request
2023-01-19T23:02:01.6646898Z     self._send_request(method, url, body, headers, encode_chunked)
2023-01-19T23:02:01.6647263Z   File "/opt/conda/lib/python3.8/http/client.py", line 1302, in _send_request
2023-01-19T23:02:01.6647621Z     self.endheaders(body, encode_chunked=encode_chunked)
2023-01-19T23:02:01.6647979Z   File "/opt/conda/lib/python3.8/http/client.py", line 1251, in endheaders
2023-01-19T23:02:01.6648349Z     self._send_output(message_body, encode_chunked=encode_chunked)
2023-01-19T23:02:01.6648721Z   File "/opt/conda/lib/python3.8/http/client.py", line 1011, in _send_output
2023-01-19T23:02:01.6649011Z     self.send(msg)
2023-01-19T23:02:01.6649294Z   File "/opt/conda/lib/python3.8/http/client.py", line 951, in send
2023-01-19T23:02:01.6649577Z     self.connect()
2023-01-19T23:02:01.6649878Z   File "/opt/conda/lib/python3.8/http/client.py", line 1425, in connect
2023-01-19T23:02:01.6650223Z     self.sock = self._context.wrap_socket(self.sock,
2023-01-19T23:02:01.6650565Z   File "/opt/conda/lib/python3.8/ssl.py", line 500, in wrap_socket
2023-01-19T23:02:01.6650884Z     return self.sslsocket_class._create(
2023-01-19T23:02:01.6651191Z   File "/opt/conda/lib/python3.8/ssl.py", line 1040, in _create
2023-01-19T23:02:01.6651474Z     self.do_handshake()
2023-01-19T23:02:01.6651780Z   File "/opt/conda/lib/python3.8/ssl.py", line 1309, in do_handshake
2023-01-19T23:02:01.6652080Z     self._sslobj.do_handshake()
2023-01-19T23:02:01.6652490Z ConnectionResetError: [Errno 104] Connection reset by peer
2023-01-19T23:02:01.6652691Z 
2023-01-19T23:02:01.6652869Z During handling of the above exception, another exception occurred:
2023-01-19T23:02:01.6653080Z 
2023-01-19T23:02:01.6653223Z Traceback (most recent call last):
2023-01-19T23:02:01.6653571Z   File "test_cuda.py", line 4859, in test_memory_snapshot
2023-01-19T23:02:01.6653944Z     torch.cuda.memory._save_segment_usage(f.name)
2023-01-19T23:02:01.6654526Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/memory.py", line 641, in _save_segment_usage
2023-01-19T23:02:01.6654957Z     f.write(_segments(snapshot))
2023-01-19T23:02:01.6655501Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/_memory_viz.py", line 70, in segments
2023-01-19T23:02:01.6655939Z     return format_flamegraph(f.getvalue())
2023-01-19T23:02:01.6656511Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/_memory_viz.py", line 28, in format_flamegraph
2023-01-19T23:02:01.6656946Z     urllib.request.urlretrieve(
2023-01-19T23:02:01.6657355Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 247, in urlretrieve
2023-01-19T23:02:01.6657790Z     with contextlib.closing(urlopen(url, data)) as fp:
2023-01-19T23:02:01.6658213Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
2023-01-19T23:02:01.6658614Z     return opener.open(url, data, timeout)
2023-01-19T23:02:01.6659007Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 525, in open
2023-01-19T23:02:01.6659383Z     response = self._open(req, data)
2023-01-19T23:02:01.6659892Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 542, in _open
2023-01-19T23:02:01.6660400Z     result = self._call_chain(self.handle_open, protocol, protocol +
2023-01-19T23:02:01.6660857Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
2023-01-19T23:02:01.6661218Z     result = func(*args)
2023-01-19T23:02:01.6661602Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1397, in https_open
2023-01-19T23:02:01.6662043Z     return self.do_open(http.client.HTTPSConnection, req,
2023-01-19T23:02:01.6662487Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1357, in do_open
2023-01-19T23:02:01.6662856Z     raise URLError(err)
2023-01-19T23:02:01.6663233Z urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
2023-01-19T23:02:01.6663455Z 
2023-01-19T23:02:01.6663703Z ----------------------------------------------------------------------

awgu · 2023-01-20T02:01:15Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2023-01-20T02:02:55Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 4224de3 Pull Request resolved: pytorch#92296

[FSDP][1/N] Split fully_shard unit tests

70303a0

[ghstack-poisoned]

pytorch-bot bot added the topic: not user facing topic category label Jan 17, 2023

This was referenced Jan 17, 2023

[FSDP][2/N] _summon_full_params -> _unshard_params #92297

Closed

[FSDP][3/N] Refactor summon_full_params unit tests #92298

Closed

awgu added the release notes: distributed (composable) label Jan 17, 2023

awgu pushed a commit to awgu/pytorch that referenced this pull request Jan 17, 2023

[FSDP][1/N] Split fully_shard unit tests

0e42bdb

ghstack-source-id: 535f26d Pull Request resolved: pytorch#92296

awgu marked this pull request as ready for review January 17, 2023 17:04

awgu requested review from H-Huang, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners January 17, 2023 17:04

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 17, 2023

awgu mentioned this pull request Jan 19, 2023

[WIP][FSDP] Add unshard_params for fully_shard #92639

Closed

awgu commented Jan 19, 2023

View reviewed changes

mrshenli approved these changes Jan 19, 2023

View reviewed changes

pytorchmergebot added the Merged label Jan 20, 2023

pytorchmergebot closed this in f659452 Jan 20, 2023

This was referenced Jan 20, 2023

[Reland][FSDP] Do not clean FQNs for use_orig_params=True #92662

Closed

[PT-D][Lint] Include nested directories to ufmt #92779

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Jan 23, 2023

[FSDP][1/N] Split fully_shard unit tests

1dec265

ghstack-source-id: 4224de3 Pull Request resolved: pytorch#92296

This was referenced Jan 24, 2023

[FSDP] Fix no_sync(), use_orig_params=True, mixed precision, sharded #92874

Closed

[FSDP] Test FSDP + AC composability #92935

Closed

facebook-github-bot deleted the gh/awgu/301/head branch June 8, 2023 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP][1/N] Split `fully_shard` unit tests #92296

[FSDP][1/N] Split `fully_shard` unit tests #92296

Uh oh!

awgu commented Jan 17, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 17, 2023 •

edited

Loading

Uh oh!

awgu Jan 19, 2023

Uh oh!

mrshenli left a comment

Uh oh!

awgu commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Uh oh!

awgu commented Jan 20, 2023

Uh oh!

awgu commented Jan 20, 2023

Uh oh!

pytorchmergebot commented Jan 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP][1/N] Split fully_shard unit tests #92296

[FSDP][1/N] Split fully_shard unit tests #92296

Uh oh!

Conversation

awgu commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92296

❌ 4 Failures

Uh oh!

awgu Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

awgu commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Merge started

Uh oh!

pytorchmergebot commented Jan 19, 2023

Merge failed

Uh oh!

awgu commented Jan 20, 2023

Uh oh!

awgu commented Jan 20, 2023

Uh oh!

pytorchmergebot commented Jan 20, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP][1/N] Split `fully_shard` unit tests #92296

[FSDP][1/N] Split `fully_shard` unit tests #92296

awgu commented Jan 17, 2023 •

edited

Loading

pytorch-bot bot commented Jan 17, 2023 •

edited

Loading