-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[DataPipe] Separating DataPipes from Dataset into different files #73396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit e963df1 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Thank you
Have you tested it with import torchdata. Just want to verify if this PR introduces any BC breaking changes.
And, what do you think about the name/location of this datapipe.py file because we have a folder called datapipes at the same directory. It can be confusing for users when doing import in python. Is that possible to move this file into datapipes folder but expose the API at the same place.
I think
I have tried moving the classes into the pytorch/torch/utils/data/__init__.py Lines 1 to 23 in 0b1f3bd
|
torch/utils/data/datapipe.py
Outdated
|
|
||
| from torch.utils.data._typing import _DataPipeMeta | ||
| from torch.utils.data._utils.serialization import serialize_fn, SerializationType, deserialize_fn | ||
| from torch.utils.data.dataset import IterableDataset, T_co, UNTRACABLE_DATAFRAME_PIPES, Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T_co and UNTRACABLE_DATAFRAME_PIPES should be created from this file rather than importing them.
And, if we do import torch.utils.data.dataset.IterableDataset and import torch.utils.data.dataset.Dataset, would circular import still happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just refactored it, though most of the import statements for DataPipes have to be changed.
I am explicitly leaving graph.py and graph_settings.py in torch/utils/data (instead of moving them into the datapipes directory) because they are needed by DataLoader. Let me know if you think those two files should be moved as well.
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
ejguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you soooo much! This is looks awesome.
…t files" Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Differential Revision: [D34481962](https://our.internmc.facebook.com/intern/diff/D34481962) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…3396) Summary: Pull Request resolved: #73396 Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes meta-pytorch/data#213 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34481962 Pulled By: NivekT fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
Stack from ghstack:
Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.
I have also tried to move
datapipe.pyintotorch.utils.data.datapipes, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?Fixes meta-pytorch/data#213
Differential Revision: D34481962