Skip to content

Conversation

@wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented Jan 24, 2023

Stack from ghstack (oldest at bottom):

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:

with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))

It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving mesh as numpy array instead.

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 24, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92930

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit cbbaba4:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure? Would be nice to know what this does

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

[ghstack-poisoned]
This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

[ghstack-poisoned]
@wanchaol
Copy link
Collaborator Author

sure? Would be nice to know what this does

Yeah thanks! Just updated the comment of the PR to include more concrete details. tl.dr is that we want DeviceMesh's mesh tensor field to avoid the interaction with our tensor subclass systems like FakeTensorMode.

@ezyang
Copy link
Contributor

ezyang commented Jan 26, 2023

Saving device mesh as a numpy array... feels wrong. Tell me about why it wants to be a tensor/ndarray in the first place?

@wanchaol
Copy link
Collaborator Author

Saving device mesh as a numpy array... feels wrong. Tell me about why it wants to be a tensor/ndarray in the first place?

Sure, the underlying rationale is that we want to use DeviceMesh to represent the cluster topology, where we could setup communicators base on the topology, it uses the mesh concept similar to other frameworks where it could possibly structure the devices into a n-d array, for example, we have two hosts and each host have four devices, we would like to save it as sth like

[[0, 1, 2, 3], 
 [4, 5, 6, 7]]

This represents the topology of the two hosts, and we want to know the shape/size on each dim of this n-d array easily, hence either torch.Tensor or numpy.array as the storage of the mesh works for this case. We were using torch.Tensor but later we realized it interacts with fake tensor or tracing (where we don't need to). So I would like to switch to use numpy array to avoid the subsystem interactions.

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:
```python
with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))
```
It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead.


[ghstack-poisoned]
@ezyang
Copy link
Contributor

ezyang commented Jan 26, 2023

cc @eellison

But I feel like I would rather make a tensor subclass that denotes "don't fakify this ever"...

@eellison
Copy link
Contributor

But I feel like I would rather make a tensor subclass that denotes "don't fakify this ever"...

That works, and you can also do whatever non-module bookkeeping under a disable_current_modes context. This pattern is going to come up again, it does feel a bit weird to recommend numpy.

@ezyang
Copy link
Contributor

ezyang commented Jan 26, 2023

I anti-recommend disable_current_modes. The subclass should be handled by FakeTensor and it shouldn't be necessary for the user to manually disable the mode.

@eellison
Copy link
Contributor

eellison commented Jan 26, 2023

I anti-recommend disable_current_modes

Why ? In general I think you'd prefer changes that don't require intrusive changes to other core classes. That also won't compose with proxy tensor tracing if they're doing deffered init or something.

API wise it might be easier to not have to re-call disable_current_modes definitely.

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:
```python
with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))
```
It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead.


[ghstack-poisoned]
@wanchaol
Copy link
Collaborator Author

wanchaol commented Jan 26, 2023

@ezyang @eellison for the tensor subclass approach or the disable mode approach, is it possible for user to do sth like the comment in the summary:

with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))

i.e. user create a plain tensor pass to FakeTensorMode and it will not be fakify by the mode?

@ezyang
Copy link
Contributor

ezyang commented Jan 26, 2023

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't you need to add it to https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt and

'python3',
otherwise it will be a no-op?

@wanchaol
Copy link
Collaborator Author

wanchaol commented Jan 27, 2023

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

@ezyang I think that would be not ideal for the user, user need to use some special factory method and it might not be an intuitive UX. Another option maybe we can just make DeviceMesh a subclass of Tensor? would that make fake tensor and tracing happy?

edit: actually I think directly make DeviceMesh a tensor subclass also seems not gonna work, as we still need to pass the torch.Tensor to that tensor subclass.

@wanchaol
Copy link
Collaborator Author

Hmm, don't you need to add it to https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt and

'python3',

otherwise it will be a no-op?

@malfet it looks like lintrunner already have numpy in the init command, maybe I should only update requirements_ci.txt to uncomment numpy? https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L109

@malfet
Copy link
Contributor

malfet commented Jan 27, 2023

@malfet it looks like lintrunner already have numpy in the init command, maybe I should only update requirements_ci.txt to uncomment numpy? https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L109

Sorry, I misunderstood where the plugin should me coming from. And CI preinstalls numpy, but from anaconda rather than PyPI, this is why it's commented out there

@ezyang
Copy link
Contributor

ezyang commented Jan 27, 2023

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

@ezyang I think that would be not ideal for the user, user need to use some special factory method and it might not be an intuitive UX.

Is this not what you have done if you make people allocate numpy arrays?

Another option maybe we can just make DeviceMesh a subclass of Tensor? would that make fake tensor and tracing happy?

Let's rewind a sec actually. In normal eager mode, there is no problem with fake tensor mode because people don't write code under fake tensor mode. So you can use regular tensor for mesh, no problem.

So the problem is when you trace through the model. If we can arrange the tracing process to not fakeify tensor computation that goes to the mesh, that's what we want.

edit: actually I think directly make DeviceMesh a tensor subclass also seems not gonna work, as we still need to pass the torch.Tensor to that tensor subclass.

Another possibility is to add a 'mesh' or 'const' device type, which is always small and OK to always materialize...

@wanchaol
Copy link
Collaborator Author

Is this not what you have done if you make people allocate numpy arrays?

Yeah this make sense, let me think about how to make a custom factory method that provide a similar UX compare to numpy and Tensor. Regardless of the approach we are going to take, the numpy ndarray typing is still valuable as we want to accept both a n-d list directly (i.e. [0, 1, 2, 3]) or a tensor subclass instance/ndarray

Let's rewind a sec actually. In normal eager mode, there is no problem with fake tensor mode because people don't write code under fake tensor mode. So you can use regular tensor for mesh, no problem.

So the problem is when you trace through the model. If we can arrange the tracing process to not fakeify tensor computation that goes to the mesh, that's what we want.

Yep agreed that it's mostly tracing right now have issues. But I think most likely even eager mode would want the FakeTensorMode once FakeTensorMode supports deferred_init like torchdistx, as we would still want to lazy materialize the model (after tensor parallel), when the model is too large.

Another possibility is to add a 'mesh' or 'const' device type, which is always small and OK to always materialize...

Do you mean add a device type to core like device=mesh and inside the device type we always materialize the tensor? Yeah that would be good too if that avoids interaction with FakeTensor

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:
```python
with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))
```
It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead.


[ghstack-poisoned]
@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2023
@wanchaol
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed (Rule superuser). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@wanchaol
Copy link
Collaborator Author

@pytorchbot merge -f "failure not related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@clee2000
Copy link
Contributor

@pytorchbot revert -m "causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 https://hud.pytorch.org/pytorch/pytorch/commit/5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72, note for revert review: PR was forced merged after first failure, which was flaky" -c ignoredsignal

@pytorch pytorch deleted a comment from pytorch-bot bot Jan 31, 2023
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@wanchaol your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jan 31, 2023
This reverts commit 5f1ac18.

Reverted #92930 on behalf of https://github.com/clee2000 due to causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 https://hud.pytorch.org/pytorch/pytorch/commit/5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72, note for revert review: PR was forced merged after first failure, which was flaky
wanchaol added a commit that referenced this pull request Feb 9, 2023
wanchaol added a commit that referenced this pull request Feb 9, 2023
wanchaol added a commit that referenced this pull request Mar 28, 2023
wanchaol added a commit that referenced this pull request Mar 28, 2023
pytorchmergebot pushed a commit that referenced this pull request Mar 29, 2023
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/250/head branch June 8, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants