type annotations for dataloader, dataset, sampler #39392

Baranowski · 2020-06-02T08:26:58Z

Fixes #38913

dr-ci · 2020-06-02T08:39:02Z

💊 CI failures summary and remediations

As of commit 590810a (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 73 times.

zou3519 · 2020-06-11T19:27:50Z

torch/utils/data/sampler.py

According to the note at

pytorch/torch/utils/data/sampler.py

Lines 23 to 48 in 2b29fea

# NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]

#

# Many times we have an abstract class representing a collection/iterable of

# data, e.g., `torch.utils.data.Sampler`, with its subclasses optionally

# implementing a `__len__` method. In such cases, we must make sure to not

# provide a default implementation, because both straightforward default

# implementations have their issues:

#

# + `return NotImplemented`:

# Calling `len(subclass_instance)` raises:

# TypeError: 'NotImplementedType' object cannot be interpreted as an integer

#

# + `raise NotImplementedError()`:

# This prevents triggering some fallback behavior. E.g., the built-in

# `list(X)` tries to call `len(X)` first, and executes a different code

# path if the method is not found or `NotImplemented` is returned, while

# raising an `NotImplementedError` will propagate and and make the call

# fail where it could have use `__iter__` to complete the call.

#

# Thus, the only two sensible things to do are

#

# + **not** provide a default `__len__`.

#

# + raise a `TypeError` instead, which is what Python uses when users call

# a method that is not defined on an object.

# (@ssnl verifies that this works on at least Python 3.7.)

, we should not be providing a default implementation of __len__. Do we need SizedSampler for all the types to check out? Scrolling through this PR it looks like SizedSampler is only used to construct the other Samplers

torch/utils/data/dataset.py

zou3519

I had some questions/suggestions. The PR generally looks fine to me otherwise.

I'm not very familiar with how type hints for these files are tested in our CI. Is the test_typing in test_dataloader.py sufficient to check everything, or are there other tests somewhere?

torch/utils/data/sampler.py

torch/utils/data/dataloader.py

Baranowski · 2020-06-16T20:34:04Z

Thanks for the review @zou3519. Re: tests, there is also test/test_type_hints.py which runs mypy.

zou3519

I'm not a big fan of the # type: ignores, but it doesn't look like we can do anything about it and this PR also makes the current state better than before so let's ship it

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rgommers

LGTM too, thanks @Baranowski.

One thing I noticed is that this is still present in mypy.ini:

[mypy-torch.utils.data.dataset]
ignore_errors = True

I suspect you've run mypy directly on the file to test your changes, in which case the ignore doesn't matter. In CI the ignore does matter though, so I'd recommend to remove it in this PR (or as a follow up) if tests test_type_hints.py passes locally.

Baranowski · 2020-06-18T11:59:09Z

I was relying on test_type_hints.py, so I will need to fix up dataset.py as well

torch/utils/data/dataloader.py

torch/utils/data/dataset.py

rgommers

LGTM modulo the couple of minor open comments.

The PR is still draft, I assume it's not anymore - looks about ready.

Baranowski · 2020-06-19T13:07:54Z

The CI failure looks unrelated so this PR should be ready now.

zou3519

lgtm as well

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519

There are some other internal failures that I am reading, will report back

zou3519 · 2020-06-22T14:48:38Z

torch/utils/data/dataset.py

This triggered in one of the internal tests. index doesn't have to be an int and indeed in the documentation we have a note that we support non-integral indices/keys with custom samplers. So let's type index as Any. (Ideally we would also add a test to pytorch to type-check a custom sampler with non-integral key, but that sounds potentially annoying)

zou3519

I think this is the last external change we need. I have some code ready to change some (previously incorrectly typed) internal code snippets that are using Dataloader / Dataset

zou3519 · 2020-06-24T21:19:36Z

torch/utils/data/dataloader.py

Is it possible to list sampler: Sampler here? Some internal code attempts to access dataloader.sampler and the type appears was inferred to be Optional[Sampler]. However, after __init__, we're sure that sampler is a Sampler

(I verified that adding sampler: Sampler makes the problem go away, but I'm not sure if there was a reason why it wasn't here)

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519 · 2020-06-25T14:44:23Z

@Baranowski, @ssnl, how do you feel about adding an abstract __len__ method to Sampler?

pytorch/torch/utils/data/sampler.py

Line 23 in e440c37

# NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]

The rationale behind this is that I'm sure there is a lot of user code out there that checks len(loader.sampler) and it feels wrong for that to not type-check if most of the Samplers include with PyTorch support __len__. Making Sampler inherit from typing.Sized gives Sampler a __len__ function that returns a TypeError.

@ssnl I read

pytorch/torch/utils/data/sampler.py

Line 46 in e440c37

# + raise a `TypeError` instead, which is what Python uses when users call

, and I think if we verify that for all versions of Python that we support, if the following gives something sensible, then it should be an acceptable solution.

class Foo(typing.Sized):
    pass
len(Foo())

Baranowski · 2020-06-25T17:06:15Z

@zou3519 I don't have enough experience with Python to have a valuable opinion. I'm happy to do whatever you guys think is best.

ssnl · 2020-06-25T19:45:47Z

@zou3519 However the sampler doesn't necessarily have to be Sized (i.e., have a working __len__ though). Really it is just required to be an Iterable.

If we make Sampler inherit from Sized and Iterable and not implement __len__ then abc complains:

In [4]: class A(typing.Sized, typing.Iterable):
   ...:     def __iter__(self): yield from [1,2,3]
   ...:

In [5]: A()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-6234893e030b> in <module>
----> 1 A()

~/miniconda3/lib/python3.7/typing.py in __new__(cls, *args, **kwds)
    816             obj = super().__new__(cls)
    817         else:
--> 818             obj = super().__new__(cls, *args, **kwds)
    819         return obj
    820

TypeError: Can't instantiate abstract class A with abstract methods __len__

I don't know if we should be adding a __len__ that just raises TypeError to resolve this... Maybe we should?

zou3519 · 2020-06-30T14:11:28Z

@ssnl thanks for the context and testing that. I'm not sure if we should raise a TypeError either. It sounds like we should keep things as is-right now (Sampler without __len__ method), and mark this PR as potentially bc-breaking (it can break type-checked user code that tries to access dataloader.sampler; the workaround would be to add a # type: ignore or to use typing.cast to cast the sampler to the expected sampler type).

Baranowski · 2020-07-04T08:27:09Z

@zou3519 @ssnl so as I understand, there is nothing more for me to do here? Is this ready to import and merge?

zou3519 · 2020-07-06T18:10:44Z

so as I understand, there is nothing more for me to do here? Is this ready to import and merge?

Yeah that's correct. I'm working on the import and merge

zou3519 · 2020-07-06T18:15:24Z

@Baranowski actually, could you please rebase this PR and resolve the merge conflict? I think it should just be accepting the changes to tools/pyi/gen_pyi.py

@Baranowski to provide some more transparency, we have some internal code that previously wasn't type checked but now is type checked as a result of adding the annotations. I've been working on correcting the annotations for that code and am done with that, so this should be good to go very soon after we resolve the merge conflict and wait for tests to pass

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pytorchbot added the open source label Jun 2, 2020

Baranowski force-pushed the wbaranowski-dataset_typing-38913 branch 2 times, most recently from d119a8f to 85efdfa Compare June 8, 2020 06:22

Baranowski changed the title ~~[WiP] type annotations for dataloader, dataset, sampler~~ type annotations for dataloader, dataset, sampler Jun 10, 2020

Baranowski marked this pull request as ready for review June 10, 2020 05:21

Baranowski requested a review from apaszke as a code owner June 10, 2020 05:21

ngimel requested a review from zou3519 June 10, 2020 19:21

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 10, 2020

zou3519 reviewed Jun 11, 2020

View reviewed changes

jeongukjae reviewed Jun 12, 2020

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

Baranowski force-pushed the wbaranowski-dataset_typing-38913 branch from 09b3792 to be10252 Compare June 16, 2020 07:53

zou3519 reviewed Jun 16, 2020

View reviewed changes

torch/utils/data/sampler.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

zou3519 approved these changes Jun 17, 2020

View reviewed changes

facebook-github-bot reviewed Jun 17, 2020

View reviewed changes

rgommers reviewed Jun 18, 2020

View reviewed changes

Baranowski marked this pull request as draft June 18, 2020 11:58

rgommers reviewed Jun 18, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

rgommers reviewed Jun 18, 2020

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

rgommers approved these changes Jun 18, 2020

View reviewed changes

Baranowski force-pushed the wbaranowski-dataset_typing-38913 branch from 2118617 to 678e5cf Compare June 19, 2020 07:29

Baranowski marked this pull request as ready for review June 19, 2020 13:07

zou3519 approved these changes Jun 19, 2020

View reviewed changes

facebook-github-bot reviewed Jun 19, 2020

View reviewed changes

zou3519 reviewed Jun 22, 2020

View reviewed changes

Baranowski force-pushed the wbaranowski-dataset_typing-38913 branch from 91eb2c6 to c672dda Compare June 24, 2020 08:12

zou3519 reviewed Jun 24, 2020

View reviewed changes

facebook-github-bot reviewed Jun 25, 2020

View reviewed changes

Baranowski added 8 commits July 6, 2020 22:42

Type annotations for dataset and dataloader

c10b68f

BatchSampler __len__ comment

c1a0338

Ref NOTE about lack of __len__

f447614

Fix up dataset.py

b08650a

TensorDataset.tensors: Tuple[Tensor, ...]

1e008f6

update comment

d7b3c46

Dataset.__getitem__ doesn't require int

7274967

sampler: Sampler

590810a

Baranowski force-pushed the wbaranowski-dataset_typing-38913 branch from 28bd2f2 to 590810a Compare July 6, 2020 19:44

facebook-github-bot reviewed Jul 6, 2020

View reviewed changes

facebook-github-bot closed this in 0e09511 Jul 7, 2020

jeongukjae mentioned this pull request Jul 21, 2020

inline DistributedSampler's type annotations to python module #41778

Closed

	# NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
	#
	# Many times we have an abstract class representing a collection/iterable of
	# data, e.g., `torch.utils.data.Sampler`, with its subclasses optionally
	# implementing a `__len__` method. In such cases, we must make sure to not
	# provide a default implementation, because both straightforward default
	# implementations have their issues:
	#
	# + `return NotImplemented`:
	# Calling `len(subclass_instance)` raises:
	# TypeError: 'NotImplementedType' object cannot be interpreted as an integer
	#
	# + `raise NotImplementedError()`:
	# This prevents triggering some fallback behavior. E.g., the built-in
	# `list(X)` tries to call `len(X)` first, and executes a different code
	# path if the method is not found or `NotImplemented` is returned, while
	# raising an `NotImplementedError` will propagate and and make the call
	# fail where it could have use `__iter__` to complete the call.
	#
	# Thus, the only two sensible things to do are
	#
	# + not provide a default `__len__`.
	#
	# + raise a `TypeError` instead, which is what Python uses when users call
	# a method that is not defined on an object.
	# (@ssnl verifies that this works on at least Python 3.7.)

type annotations for dataloader, dataset, sampler #39392

type annotations for dataloader, dataset, sampler #39392

Uh oh!

Conversation

Baranowski commented Jun 2, 2020

Uh oh!

dr-ci bot commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

zou3519 Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Baranowski commented Jun 16, 2020

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

Baranowski commented Jun 18, 2020

Uh oh!

Uh oh!

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

Baranowski commented Jun 19, 2020

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

Baranowski Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

zou3519 Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Jun 25, 2020

Uh oh!

Baranowski commented Jun 25, 2020

Uh oh!

ssnl commented Jun 25, 2020

Uh oh!

zou3519 commented Jun 30, 2020

Uh oh!

Baranowski commented Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Jul 6, 2020

dr-ci bot commented Jun 2, 2020 •

edited

Loading

Baranowski commented Jul 4, 2020 •

edited

Loading