Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Jan 17, 2023

Stack from ghstack:

This PR splits test_fully_shard.py into fully_shard/test_fully_shard<...>.py. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a TODO for using run_subtests in the model checkpointing unit tests.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92296

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit c5f0d89:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 17, 2023
awgu pushed a commit to awgu/pytorch that referenced this pull request Jan 17, 2023
ghstack-source-id: 535f26d
Pull Request resolved: pytorch#92296
@awgu awgu marked this pull request as ready for review January 17, 2023 17:04
@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 17, 2023
This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests.

[ghstack-poisoned]
This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests.

[ghstack-poisoned]
This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests.

[ghstack-poisoned]
E2E test of save + load with rank0_only + CPU offload for TransformerWithSharedParams
on the composable path.
"""
self.run_subtests(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only real change, where I knocked out a to-do to use run_subtests.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@awgu
Copy link
Collaborator Author

awgu commented Jan 19, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed (Rule Distributed). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@awgu
Copy link
Collaborator Author

awgu commented Jan 20, 2023

Failures look unrelated:

linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.4xlarge)

ERROR: Could not find a version that satisfies the requirement astunparse (from -r requirements.txt (line 2)) (from versions: none)
ERROR: No matching distribution found for astunparse (from -r requirements.txt (line 2))

linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)

ERROR: Could not find a version that satisfies the requirement astunparse (from -r requirements.txt (line 2)) (from versions: none)
ERROR: No matching distribution found for astunparse (from -r requirements.txt (line 2))

linux-focal-rocm5.3-py3.8 / test (default, 1, 2, linux.rocm.gpu)

======================================================================
2023-01-19T23:02:01.6644719Z ERROR [0.056s]: test_memory_snapshot (__main__.TestCudaComm)
2023-01-19T23:02:01.6645149Z ----------------------------------------------------------------------
2023-01-19T23:02:01.6645477Z Traceback (most recent call last):
2023-01-19T23:02:01.6645808Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1354, in do_open
2023-01-19T23:02:01.6646172Z     h.request(req.get_method(), req.selector, req.data, headers,
2023-01-19T23:02:01.6646531Z   File "/opt/conda/lib/python3.8/http/client.py", line 1256, in request
2023-01-19T23:02:01.6646898Z     self._send_request(method, url, body, headers, encode_chunked)
2023-01-19T23:02:01.6647263Z   File "/opt/conda/lib/python3.8/http/client.py", line 1302, in _send_request
2023-01-19T23:02:01.6647621Z     self.endheaders(body, encode_chunked=encode_chunked)
2023-01-19T23:02:01.6647979Z   File "/opt/conda/lib/python3.8/http/client.py", line 1251, in endheaders
2023-01-19T23:02:01.6648349Z     self._send_output(message_body, encode_chunked=encode_chunked)
2023-01-19T23:02:01.6648721Z   File "/opt/conda/lib/python3.8/http/client.py", line 1011, in _send_output
2023-01-19T23:02:01.6649011Z     self.send(msg)
2023-01-19T23:02:01.6649294Z   File "/opt/conda/lib/python3.8/http/client.py", line 951, in send
2023-01-19T23:02:01.6649577Z     self.connect()
2023-01-19T23:02:01.6649878Z   File "/opt/conda/lib/python3.8/http/client.py", line 1425, in connect
2023-01-19T23:02:01.6650223Z     self.sock = self._context.wrap_socket(self.sock,
2023-01-19T23:02:01.6650565Z   File "/opt/conda/lib/python3.8/ssl.py", line 500, in wrap_socket
2023-01-19T23:02:01.6650884Z     return self.sslsocket_class._create(
2023-01-19T23:02:01.6651191Z   File "/opt/conda/lib/python3.8/ssl.py", line 1040, in _create
2023-01-19T23:02:01.6651474Z     self.do_handshake()
2023-01-19T23:02:01.6651780Z   File "/opt/conda/lib/python3.8/ssl.py", line 1309, in do_handshake
2023-01-19T23:02:01.6652080Z     self._sslobj.do_handshake()
2023-01-19T23:02:01.6652490Z ConnectionResetError: [Errno 104] Connection reset by peer
2023-01-19T23:02:01.6652691Z 
2023-01-19T23:02:01.6652869Z During handling of the above exception, another exception occurred:
2023-01-19T23:02:01.6653080Z 
2023-01-19T23:02:01.6653223Z Traceback (most recent call last):
2023-01-19T23:02:01.6653571Z   File "test_cuda.py", line 4859, in test_memory_snapshot
2023-01-19T23:02:01.6653944Z     torch.cuda.memory._save_segment_usage(f.name)
2023-01-19T23:02:01.6654526Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/memory.py", line 641, in _save_segment_usage
2023-01-19T23:02:01.6654957Z     f.write(_segments(snapshot))
2023-01-19T23:02:01.6655501Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/_memory_viz.py", line 70, in segments
2023-01-19T23:02:01.6655939Z     return format_flamegraph(f.getvalue())
2023-01-19T23:02:01.6656511Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/_memory_viz.py", line 28, in format_flamegraph
2023-01-19T23:02:01.6656946Z     urllib.request.urlretrieve(
2023-01-19T23:02:01.6657355Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 247, in urlretrieve
2023-01-19T23:02:01.6657790Z     with contextlib.closing(urlopen(url, data)) as fp:
2023-01-19T23:02:01.6658213Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
2023-01-19T23:02:01.6658614Z     return opener.open(url, data, timeout)
2023-01-19T23:02:01.6659007Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 525, in open
2023-01-19T23:02:01.6659383Z     response = self._open(req, data)
2023-01-19T23:02:01.6659892Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 542, in _open
2023-01-19T23:02:01.6660400Z     result = self._call_chain(self.handle_open, protocol, protocol +
2023-01-19T23:02:01.6660857Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
2023-01-19T23:02:01.6661218Z     result = func(*args)
2023-01-19T23:02:01.6661602Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1397, in https_open
2023-01-19T23:02:01.6662043Z     return self.do_open(http.client.HTTPSConnection, req,
2023-01-19T23:02:01.6662487Z   File "/opt/conda/lib/python3.8/urllib/request.py", line 1357, in do_open
2023-01-19T23:02:01.6662856Z     raise URLError(err)
2023-01-19T23:02:01.6663233Z urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
2023-01-19T23:02:01.6663455Z 
2023-01-19T23:02:01.6663703Z ----------------------------------------------------------------------

@awgu
Copy link
Collaborator Author

awgu commented Jan 20, 2023

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants