add numpy typing plugin to mypy config #92930

wanchaol · 2023-01-24T23:31:50Z

Stack from ghstack (oldest at bottom):

This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:

with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))

It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving mesh as numpy array instead.

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations [ghstack-poisoned]

pytorch-bot · 2023-01-24T23:31:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92930

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit cbbaba4:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang

sure? Would be nice to know what this does

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations [ghstack-poisoned]

wanchaol · 2023-01-26T04:12:26Z

sure? Would be nice to know what this does

Yeah thanks! Just updated the comment of the PR to include more concrete details. tl.dr is that we want DeviceMesh's mesh tensor field to avoid the interaction with our tensor subclass systems like FakeTensorMode.

ezyang · 2023-01-26T12:33:47Z

Saving device mesh as a numpy array... feels wrong. Tell me about why it wants to be a tensor/ndarray in the first place?

wanchaol · 2023-01-26T16:46:18Z

Saving device mesh as a numpy array... feels wrong. Tell me about why it wants to be a tensor/ndarray in the first place?

Sure, the underlying rationale is that we want to use DeviceMesh to represent the cluster topology, where we could setup communicators base on the topology, it uses the mesh concept similar to other frameworks where it could possibly structure the devices into a n-d array, for example, we have two hosts and each host have four devices, we would like to save it as sth like

[[0, 1, 2, 3], 
 [4, 5, 6, 7]]

This represents the topology of the two hosts, and we want to know the shape/size on each dim of this n-d array easily, hence either torch.Tensor or numpy.array as the storage of the mesh works for this case. We were using torch.Tensor but later we realized it interacts with fake tensor or tracing (where we don't need to). So I would like to switch to use numpy array to avoid the subsystem interactions.

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like: ```python with FakeTensorMode(): device_mesh = DeviceMesh("cuda", torch.arange(4)) ``` It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead. [ghstack-poisoned]

ezyang · 2023-01-26T17:26:27Z

cc @eellison

But I feel like I would rather make a tensor subclass that denotes "don't fakify this ever"...

eellison · 2023-01-26T17:34:15Z

But I feel like I would rather make a tensor subclass that denotes "don't fakify this ever"...

That works, and you can also do whatever non-module bookkeeping under a disable_current_modes context. This pattern is going to come up again, it does feel a bit weird to recommend numpy.

ezyang · 2023-01-26T17:36:50Z

I anti-recommend disable_current_modes. The subclass should be handled by FakeTensor and it shouldn't be necessary for the user to manually disable the mode.

eellison · 2023-01-26T17:43:23Z

I anti-recommend disable_current_modes

Why ? In general I think you'd prefer changes that don't require intrusive changes to other core classes. That also won't compose with proxy tensor tracing if they're doing deffered init or something.

API wise it might be easier to not have to re-call disable_current_modes definitely.

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like: ```python with FakeTensorMode(): device_mesh = DeviceMesh("cuda", torch.arange(4)) ``` It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead. [ghstack-poisoned]

wanchaol · 2023-01-26T18:00:01Z

@ezyang @eellison for the tensor subclass approach or the disable mode approach, is it possible for user to do sth like the comment in the summary:

with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))

i.e. user create a plain tensor pass to FakeTensorMode and it will not be fakify by the mode?

ezyang · 2023-01-26T19:30:01Z

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

malfet

Hmm, don't you need to add it to https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt and

pytorch/.lintrunner.toml

Line 134 in d9f0d14

'python3',

otherwise it will be a no-op?

wanchaol · 2023-01-27T00:23:09Z

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

@ezyang I think that would be not ideal for the user, user need to use some special factory method and it might not be an intuitive UX. Another option maybe we can just make DeviceMesh a subclass of Tensor? would that make fake tensor and tracing happy?

edit: actually I think directly make DeviceMesh a tensor subclass also seems not gonna work, as we still need to pass the torch.Tensor to that tensor subclass.

wanchaol · 2023-01-27T00:29:55Z

Hmm, don't you need to add it to https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt and

pytorch/.lintrunner.toml

Line 134 in d9f0d14

'python3',

otherwise it will be a no-op?

@malfet it looks like lintrunner already have numpy in the init command, maybe I should only update requirements_ci.txt to uncomment numpy? https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L109

malfet · 2023-01-27T00:32:39Z

@malfet it looks like lintrunner already have numpy in the init command, maybe I should only update requirements_ci.txt to uncomment numpy? https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L109

Sorry, I misunderstood where the plugin should me coming from. And CI preinstalls numpy, but from anaconda rather than PyPI, this is why it's commented out there

ezyang · 2023-01-27T02:56:10Z

No, you would have to provide special factory functions or have the user turn on a separate mode that is "these are constants, don't fakeify them"

@ezyang I think that would be not ideal for the user, user need to use some special factory method and it might not be an intuitive UX.

Is this not what you have done if you make people allocate numpy arrays?

Another option maybe we can just make DeviceMesh a subclass of Tensor? would that make fake tensor and tracing happy?

Let's rewind a sec actually. In normal eager mode, there is no problem with fake tensor mode because people don't write code under fake tensor mode. So you can use regular tensor for mesh, no problem.

So the problem is when you trace through the model. If we can arrange the tracing process to not fakeify tensor computation that goes to the mesh, that's what we want.

edit: actually I think directly make DeviceMesh a tensor subclass also seems not gonna work, as we still need to pass the torch.Tensor to that tensor subclass.

Another possibility is to add a 'mesh' or 'const' device type, which is always small and OK to always materialize...

wanchaol · 2023-01-27T17:30:34Z

Is this not what you have done if you make people allocate numpy arrays?

Yeah this make sense, let me think about how to make a custom factory method that provide a similar UX compare to numpy and Tensor. Regardless of the approach we are going to take, the numpy ndarray typing is still valuable as we want to accept both a n-d list directly (i.e. [0, 1, 2, 3]) or a tensor subclass instance/ndarray

Let's rewind a sec actually. In normal eager mode, there is no problem with fake tensor mode because people don't write code under fake tensor mode. So you can use regular tensor for mesh, no problem.

So the problem is when you trace through the model. If we can arrange the tracing process to not fakeify tensor computation that goes to the mesh, that's what we want.

Yep agreed that it's mostly tracing right now have issues. But I think most likely even eager mode would want the FakeTensorMode once FakeTensorMode supports deferred_init like torchdistx, as we would still want to lazy materialize the model (after tensor parallel), when the model is too large.

Another possibility is to add a 'mesh' or 'const' device type, which is always small and OK to always materialize...

Do you mean add a device type to core like device=mesh and inside the device type we always materialize the tensor? Yeah that would be good too if that avoids interaction with FakeTensor

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations Please see #92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like: ```python with FakeTensorMode(): device_mesh = DeviceMesh("cuda", torch.arange(4)) ``` It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead. [ghstack-poisoned]

wanchaol · 2023-01-30T23:21:30Z

@pytorchbot merge

pytorchmergebot · 2023-01-30T23:25:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-30T23:25:27Z

Merge failed

Reason: 1 mandatory check(s) failed (Rule superuser). The first few are:

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

wanchaol · 2023-01-31T00:11:29Z

@pytorchbot merge -f "failure not related"

pytorchmergebot · 2023-01-31T00:13:08Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2023-01-31T01:09:56Z

@pytorchbot revert -m "causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 https://hud.pytorch.org/pytorch/pytorch/commit/5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72, note for revert review: PR was forced merged after first failure, which was flaky" -c ignoredsignal

pytorchmergebot · 2023-01-31T01:12:56Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-01-31T01:13:05Z

@wanchaol your PR has been successfully reverted.

This reverts commit 5f1ac18. Reverted #92930 on behalf of https://github.com/clee2000 due to causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 https://hud.pytorch.org/pytorch/pytorch/commit/5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72, note for revert review: PR was forced merged after first failure, which was flaky

…onfig" reland of #92930 [ghstack-poisoned]

reland of #92930 [ghstack-poisoned]

…onfig" reland of #92930 [ghstack-poisoned]

reland of #92930 [ghstack-poisoned]

reland of #92930 Pull Request resolved: #94525 Approved by: https://github.com/huydhn

add numpy typing plugin to mypy config

d53746e

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations [ghstack-poisoned]

pytorch-bot bot added the topic: not user facing topic category label Jan 24, 2023

wanchaol requested review from a team, ezyang and malfet January 24, 2023 23:36

ezyang approved these changes Jan 25, 2023

View reviewed changes

Update on "add numpy typing plugin to mypy config"

615d2ed

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations [ghstack-poisoned]

wanchaol mentioned this pull request Jan 26, 2023

[dtensor][7/N] remove backend in with_comms #93040

Closed

Update on "add numpy typing plugin to mypy config"

a1cfc15

This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations [ghstack-poisoned]

malfet requested changes Jan 26, 2023

View reviewed changes

malfet approved these changes Jan 27, 2023

View reviewed changes

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2023

pytorchmergebot added the Merged label Jan 31, 2023

pytorchmergebot closed this in 5f1ac18 Jan 31, 2023

pytorch deleted a comment from pytorch-bot bot Jan 31, 2023

pytorchmergebot added the Reverted label Jan 31, 2023

wanchaol mentioned this pull request Feb 9, 2023

[reland] add numpy typing plugin to mypy config #94525

Closed

wanchaol added a commit that referenced this pull request Feb 9, 2023

Update base for Update on "[reland] add numpy typing plugin to mypy c…

6947a9a

…onfig" reland of #92930 [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Feb 9, 2023

Update on "[reland] add numpy typing plugin to mypy config"

391ddda

reland of #92930 [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Mar 28, 2023

Update base for Update on "[reland] add numpy typing plugin to mypy c…

f5e35f0

…onfig" reland of #92930 [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Mar 28, 2023

Update on "[reland] add numpy typing plugin to mypy config"

efd0b1f

reland of #92930 [ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Mar 29, 2023

[reland] add numpy typing plugin to mypy config (#94525)

8a45bef

reland of #92930 Pull Request resolved: #94525 Approved by: https://github.com/huydhn

facebook-github-bot deleted the gh/wanchaol/250/head branch June 8, 2023 19:08

add numpy typing plugin to mypy config #92930

add numpy typing plugin to mypy config #92930

Uh oh!

Conversation

wanchaol commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92930

❌ 3 Failures

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Jan 26, 2023

Uh oh!

ezyang commented Jan 26, 2023

Uh oh!

wanchaol commented Jan 26, 2023

Uh oh!

ezyang commented Jan 26, 2023

Uh oh!

eellison commented Jan 26, 2023

Uh oh!

ezyang commented Jan 26, 2023

Uh oh!

eellison commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jan 26, 2023

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol commented Jan 27, 2023

Uh oh!

malfet commented Jan 27, 2023

Uh oh!

ezyang commented Jan 27, 2023

Uh oh!

wanchaol commented Jan 27, 2023

Uh oh!

wanchaol commented Jan 30, 2023

Uh oh!

pytorchmergebot commented Jan 30, 2023

Merge started

Uh oh!

pytorchmergebot commented Jan 30, 2023

Merge failed

Uh oh!

wanchaol commented Jan 31, 2023

Uh oh!

pytorchmergebot commented Jan 31, 2023

Merge started

Uh oh!

clee2000 commented Jan 31, 2023

Uh oh!

pytorchmergebot commented Jan 31, 2023

Uh oh!

pytorchmergebot commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wanchaol commented Jan 24, 2023 •

edited

Loading

pytorch-bot bot commented Jan 24, 2023 •

edited

Loading

eellison commented Jan 26, 2023 •

edited

Loading

wanchaol commented Jan 26, 2023 •

edited

Loading

wanchaol commented Jan 27, 2023 •

edited

Loading