Skip to content

[DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement#163941

Closed
fegin wants to merge 3 commits intogh/fegin/324/basefrom
gh/fegin/324/head
Closed

[DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement#163941
fegin wants to merge 3 commits intogh/fegin/324/basefrom
gh/fegin/324/head

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Sep 26, 2025

Stack from ghstack (oldest at bottom):

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,__create_write_items__. These APIs should raise an exception if the DTensor, self, has Partial placement.

Ideally, we want to add the following test:

        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @tianyu-l @XilunWu @SherlockNoMad

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163941

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8518ccc with merge base d140325 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin added a commit that referenced this pull request Sep 26, 2025
… Partial placement

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.


ghstack-source-id: f0e7ac9
Pull-Request-resolved: #163941
@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 26, 2025
@fegin fegin requested a review from tianyu-l September 26, 2025 06:50
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks!
Seems need linting.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Sep 26, 2025
… Partial placement

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

ghstack-source-id: ad10e7f
Pull-Request-resolved: #163941
@fegin fegin added module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category labels Sep 26, 2025
[ghstack-poisoned]
fegin added a commit that referenced this pull request Sep 26, 2025
… Partial placement

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

ghstack-source-id: 6242c33
Pull-Request-resolved: #163941
@fegin
Copy link
Contributor Author

fegin commented Sep 26, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 26, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / build

Details for Dev Infra team Raised by workflow job

@Skylion007
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jainapurva pushed a commit that referenced this pull request Sep 29, 2025
… Partial placement (#163941)

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

Ideally, we want to add the following test:

```
        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

Pull Request resolved: #163941
Approved by: https://github.com/tianyu-l
maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025
… Partial placement (pytorch#163941)

A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

Ideally, we want to add the following test:

```
        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

Pull Request resolved: pytorch#163941
Approved by: https://github.com/tianyu-l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dtensor distributed tensor tag oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants