[DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement#163941
[DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement#163941fegin wants to merge 3 commits intogh/fegin/324/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163941
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8518ccc with merge base d140325 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… Partial placement A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. ghstack-source-id: f0e7ac9 Pull-Request-resolved: #163941
tianyu-l
left a comment
There was a problem hiding this comment.
Makes sense to me. Thanks!
Seems need linting.
… Partial placement A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. ghstack-source-id: ad10e7f Pull-Request-resolved: #163941
… Partial placement A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. ghstack-source-id: 6242c33 Pull-Request-resolved: #163941
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
… Partial placement (#163941) A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. Ideally, we want to add the following test: ``` with self.assertRaisesRegex( RuntimeError, "Any checkpointing related operations are not supported for" ): dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir()) ``` While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture. Pull Request resolved: #163941 Approved by: https://github.com/tianyu-l
… Partial placement (pytorch#163941) A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. Ideally, we want to add the following test: ``` with self.assertRaisesRegex( RuntimeError, "Any checkpointing related operations are not supported for" ): dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir()) ``` While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture. Pull Request resolved: pytorch#163941 Approved by: https://github.com/tianyu-l
Stack from ghstack (oldest at bottom):
A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.
There are several APIs that are only used by checkpointing, e.g.,
__create_write_items__. These APIs should raise an exception if the DTensor,self, has Partial placement.Ideally, we want to add the following test:
While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.
cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @tianyu-l @XilunWu @SherlockNoMad