-
Notifications
You must be signed in to change notification settings - Fork 26.3k
type annotations for dataloader, dataset, sampler #39392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
type annotations for dataloader, dataset, sampler #39392
Conversation
💊 CI failures summary and remediationsAs of commit 590810a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 73 times. |
d119a8f to
85efdfa
Compare
torch/utils/data/sampler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the note at
pytorch/torch/utils/data/sampler.py
Lines 23 to 48 in 2b29fea
| # NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ] | |
| # | |
| # Many times we have an abstract class representing a collection/iterable of | |
| # data, e.g., `torch.utils.data.Sampler`, with its subclasses optionally | |
| # implementing a `__len__` method. In such cases, we must make sure to not | |
| # provide a default implementation, because both straightforward default | |
| # implementations have their issues: | |
| # | |
| # + `return NotImplemented`: | |
| # Calling `len(subclass_instance)` raises: | |
| # TypeError: 'NotImplementedType' object cannot be interpreted as an integer | |
| # | |
| # + `raise NotImplementedError()`: | |
| # This prevents triggering some fallback behavior. E.g., the built-in | |
| # `list(X)` tries to call `len(X)` first, and executes a different code | |
| # path if the method is not found or `NotImplemented` is returned, while | |
| # raising an `NotImplementedError` will propagate and and make the call | |
| # fail where it could have use `__iter__` to complete the call. | |
| # | |
| # Thus, the only two sensible things to do are | |
| # | |
| # + **not** provide a default `__len__`. | |
| # | |
| # + raise a `TypeError` instead, which is what Python uses when users call | |
| # a method that is not defined on an object. | |
| # (@ssnl verifies that this works on at least Python 3.7.) |
__len__. Do we need SizedSampler for all the types to check out? Scrolling through this PR it looks like SizedSampler is only used to construct the other Samplers
09b3792 to
be10252
Compare
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some questions/suggestions. The PR generally looks fine to me otherwise.
I'm not very familiar with how type hints for these files are tested in our CI. Is the test_typing in test_dataloader.py sufficient to check everything, or are there other tests somewhere?
|
Thanks for the review @zou3519. Re: tests, there is also |
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the # type: ignores, but it doesn't look like we can do anything about it and this PR also makes the current state better than before so let's ship it
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
rgommers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thanks @Baranowski.
One thing I noticed is that this is still present in mypy.ini:
[mypy-torch.utils.data.dataset]
ignore_errors = True
I suspect you've run mypy directly on the file to test your changes, in which case the ignore doesn't matter. In CI the ignore does matter though, so I'd recommend to remove it in this PR (or as a follow up) if tests test_type_hints.py passes locally.
|
I was relying on test_type_hints.py, so I will need to fix up dataset.py as well |
rgommers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the couple of minor open comments.
The PR is still draft, I assume it's not anymore - looks about ready.
2118617 to
678e5cf
Compare
|
The CI failure looks unrelated so this PR should be ready now. |
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm as well
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other internal failures that I am reading, will report back
torch/utils/data/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggered in one of the internal tests. index doesn't have to be an int and indeed in the documentation we have a note that we support non-integral indices/keys with custom samplers. So let's type index as Any. (Ideally we would also add a test to pytorch to type-check a custom sampler with non-integral key, but that sounds potentially annoying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
91eb2c6 to
c672dda
Compare
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the last external change we need. I have some code ready to change some (previously incorrectly typed) internal code snippets that are using Dataloader / Dataset
torch/utils/data/dataloader.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to list sampler: Sampler here? Some internal code attempts to access dataloader.sampler and the type appears was inferred to be Optional[Sampler]. However, after __init__, we're sure that sampler is a Sampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I verified that adding sampler: Sampler makes the problem go away, but I'm not sure if there was a reason why it wasn't here)
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@Baranowski, @ssnl, how do you feel about adding an abstract pytorch/torch/utils/data/sampler.py Line 23 in e440c37
The rationale behind this is that I'm sure there is a lot of user code out there that checks @ssnl I read pytorch/torch/utils/data/sampler.py Line 46 in e440c37
|
|
@zou3519 I don't have enough experience with Python to have a valuable opinion. I'm happy to do whatever you guys think is best. |
|
@zou3519 However the sampler doesn't necessarily have to be If we make In [4]: class A(typing.Sized, typing.Iterable):
...: def __iter__(self): yield from [1,2,3]
...:
In [5]: A()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-6234893e030b> in <module>
----> 1 A()
~/miniconda3/lib/python3.7/typing.py in __new__(cls, *args, **kwds)
816 obj = super().__new__(cls)
817 else:
--> 818 obj = super().__new__(cls, *args, **kwds)
819 return obj
820
TypeError: Can't instantiate abstract class A with abstract methods __len__I don't know if we should be adding a |
|
@ssnl thanks for the context and testing that. I'm not sure if we should raise a TypeError either. It sounds like we should keep things as is-right now (Sampler without |
Yeah that's correct. I'm working on the import and merge |
|
@Baranowski actually, could you please rebase this PR and resolve the merge conflict? I think it should just be accepting the changes to @Baranowski to provide some more transparency, we have some internal code that previously wasn't type checked but now is type checked as a result of adding the annotations. I've been working on correcting the annotations for that code and am done with that, so this should be good to go very soon after we resolve the merge conflict and wait for tests to pass |
28bd2f2 to
590810a
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Fixes #38913