Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Mar 21, 2022

Differential Revision: D35032152

This adds a minimal set of python bindings for lazy tensor and the torchscript backend.

It targets the APIs that are used by the test_ts_opinfo.py test (which it also lands, from lazy_tensor_staging, where it is lazy_tensor_core/test/test_lazy.py).

We should land more python bindings obviously. I just wanted to focus on a minimal set that can also be tested, and use it to agree on how we organize the bindings, then others could easily contribute bindings on top of this infrastructure.

cc @JackCaoG

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 21, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 733cccf (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin I assume this line is not important. (i also assume you wrote it, maybe @desertfire did?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you can remove it. When core folks fixed convolution to flow through one op, @desertfire removed the graph checks since convolutions become just another op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we land these bindings, i'm wondering if we should try to simplify by not having helper functions like 'StepMarker'.

Taking the full view of the mark step API flow, it looks like

1 torch._lazy.mark_step (python code in torch/_lazy/init.py)
2 -> torch._C._lazy._mark_step (binding code in lazy/python/init.cpp)
3 -> StepMarker (local helper function in lazy/python/init.cpp)
4 -> finally calls torch::lazy::blah APIs

(2) can always be squished into (3) but that can make gdb debugging more annoying. But I also wonder if we can simplify to only have 3 layers, and generally flush stuff that would be in (2) either up into python in (1) or down into the core APIs (4).

Thoughts? @Krovatkin @alanwaketan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just Google's coding style to prefer wrapping a pile of code into a function just for readability not reusability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way.

(2) can always be squished into (3) but that can make gdb debugging more annoying.

If users don't really need a python binding for 2), I'd suggest we go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackCaoG here's a great example to discuss. How do you envision the mark_step API working when torch-xla is part of torch::lazy? Do you envision your users would ultimately import torch._lazy.mark_step, once you get past the stages of migration?

note: I did not intend to stop supporting _run_step_closures() but I might consider landing without it until we actually add some test coverage / usage scenarios of it.

Generally, my approach would be to only land the minimum bindings/apis we know we need and then keep adding them, rather than just land a bunch up front..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my vision torch._lazy.mark_step and xm.mark_step will do the same thing after we finish the Lazytensor Class migration. They will all trigger a Synctensor. We will slowly redirect user to use lazy version of apis and update our own tutorial.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one small difference here. lazy:0 or in our case xla:0 is not always the default device. In our TPU training case xla:0 is a CPU and xla:1 is a TPU. I think we discussed this before and you guys don't want to have a GetDefaultDevice or GetCurrentDevice API. Is this still true?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the decision is to add the GetDefaultDevice API? It's just that we never find time to do that. Maybe we should add a comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually do have a 'GetDefaultDevice' API for the backend, but we assume it is static after initialization time. Maybe what I didn't want to do was have 'set/get current device' and have some state changing.

I was also thinking about how to improve our configuration flags and backend init APIs. If we provide a good way to configure the backend configuration (python binding and then ENV override for each flag) could you use one of those flags to set default device type? Or do you need something more than that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the dummy stub we have in the staging branch:

m.def("_ltc_get_default_device", []() {
.

@wconstab wconstab requested a review from Krovatkin March 22, 2022 23:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@JackCaoG
Copy link
Collaborator

I will take a look today

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my vision torch._lazy.mark_step and xm.mark_step will do the same thing after we finish the Lazytensor Class migration. They will all trigger a Synctensor. We will slowly redirect user to use lazy version of apis and update our own tutorial.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one small difference here. lazy:0 or in our case xla:0 is not always the default device. In our TPU training case xla:0 is a CPU and xla:1 is a TPU. I think we discussed this before and you guys don't want to have a GetDefaultDevice or GetCurrentDevice API. Is this still true?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will LTC provide a way for backend to register its counter and metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't thought about this.

Right now we treat the counters' pybindings as just 'getters' (with the exception of reset). We don't actually have any API for 'configuring' the counters, bc I believe they are just configured by static macros.

I am not sure if this system will work once we combine some in-tree and external-backend code, or we will have to adapt it to 'merge' the counters from external backend together with the ones from torch::lazy:: code.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

Copy link
Contributor

@Krovatkin Krovatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you can remove it. When core folks fixed convolution to flow through one op, @desertfire removed the graph checks since convolutions become just another op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way.

(2) can always be squished into (3) but that can make gdb debugging more annoying.

If users don't really need a python binding for 2), I'd suggest we go for it.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

@facebook-github-bot
Copy link
Contributor

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of making a separate submodule, what are your thoughts on making ts_backend a submodule of _lazy? i.e. _lazy.ts_backend instead of _lazy_ts_backend

Copy link
Collaborator

@antoniojkim antoniojkim Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this might be what you intended all along based on its usage in test_ts_opinfo.py:
https://github.com/pytorch/pytorch/pull/74508/files#diff-222fb8e63ff517f76308939550ca744de90dd6ea5706e2afd3f72db019df9b90R24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea its a good question;
I tried to make it clean from the actual user API side, which is how it looks when called from test_ts_opinfo.py.

generally the c++ bindings are flatter all over pytorch and then structurized at the glue-layer where .py files re-export them in a proper namespace (the stuff happening in torch/_lazy/*.py), so I didn't think it mattered too much.

I might be convinced to change it though. Wdyt @Krovatkin ? One annoying thing is I would have to then define a 'submodule' correctly in the .pyi file that provides the mypy formatting, but i'm sure that's not too hard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the plan to expose all LTC python functions via torch._lazy? doesn't the underscore usually imply "private" and user shouldn't call? or is it eventually going to something like torch.lazy and the API exposed more similarly to how its currently done on the staging branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, torch._lazy signifies it is not a stable API, that's our convention. It can't be torch.lazy until we (as a team) make a stronger commitment to long term support, etc.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

Summary:
This adds a minimal set of python bindings for lazy tensor and the torchscript backend.

It targets the APIs that are used by the `test_ts_opinfo.py` test (which it also lands, from lazy_tensor_staging, where it is [lazy_tensor_core/test/test_lazy.py](https://github.com/pytorch/pytorch/blob/lazy_tensor_staging/lazy_tensor_core/test/test_lazy.py)).

We should land more python bindings obviously.  I just wanted to focus on a minimal set that can also be tested, and use it to agree on how we organize the bindings, then others could easily contribute bindings on top of this infrastructure.

cc JackCaoG

Pull Request resolved: pytorch#74508

Differential Revision: D35032152

Pulled By: wconstab

fbshipit-source-id: 0b6ff9b3a82cd63801dac3d267c1909bbf1f4de4
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35032152

facebook-github-bot pushed a commit that referenced this pull request Mar 29, 2022
Summary:
This adds a minimal set of python bindings for lazy tensor and the torchscript backend.

It targets the APIs that are used by the `test_ts_opinfo.py` test (which it also lands, from lazy_tensor_staging, where it is [lazy_tensor_core/test/test_lazy.py](https://github.com/pytorch/pytorch/blob/lazy_tensor_staging/lazy_tensor_core/test/test_lazy.py)).

We should land more python bindings obviously.  I just wanted to focus on a minimal set that can also be tested, and use it to agree on how we organize the bindings, then others could easily contribute bindings on top of this infrastructure.

cc JackCaoG

Pull Request resolved: #74508

Reviewed By: pbelevich

Differential Revision: D35032152

Pulled By: wconstab

fbshipit-source-id: 526505ab355b7ad27037ece0ff814b2a4b69f1e2
@github-actions
Copy link
Contributor

Hey @wconstab.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants