Refactor push_dataset_to_hub#118
Conversation
There was a problem hiding this comment.
LGTM ! 🚀 Just a minor comment.
Could you also delete the *_processor.py files? They are still present in the project here.
| # store the episode index | ||
| ep_dict["observation.image"] = torch.tensor([ep_idx] * num_frames, dtype=torch.int) | ||
| else: | ||
| ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array] |
There was a problem hiding this comment.
Maybe add a to-do item stating that in the future, the processing method will be limited to video format only, as loading individual images into RAM consumes more than 130 GB. Or adding a warning to notify the user, this line ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array] will use a lot of RAM.
There was a problem hiding this comment.
I think there's a misunderstanding. In this line: ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array], the image itself is directly added to the dictionary, not just the path in 'tmp_umi_images' as in the previous version. Consequently, at the end of the loop, the ep_dicts variable will consume 130 GB of RAM.
Alternatively, in the else clause, we should save the images in a temporary folder and update ep_dict["observation.image"] to include a list of paths from this temporary folder. This approach allows the Image class from the datasets library to convert these paths into a Hugging Face dataset without using excessive RAM. dataset.Image doc
If the else block remains unchanged, then we should warn to the user about the potential 130 GB RAM usage. However, if the else block is modified to use a temporary folder, we should retain the logging you've implemented.
There was a problem hiding this comment.
@AdilZouitine I update the warning message like you advised.
FYI I added # load 57MB of images in RAM (400x224x224x3 uint8) and we have 1447 episodes so 1447*57=82GB in RAM ^^
I think we dont care about this video=False setting.
72bcfb9 to
98f4b19
Compare
|
LGTM! 🚀 |
What does this PR do?
tests/scripts/save_dataset_to_safetensors.pyon 3 raw data format:pusht,aloha,xarm*Processorclasses and use functions to simplify the code{dataset_id}_processor.pyto{dataset_id}_{data_format}_format.pyto be more explicit (e.g.aloha_hdf5_format.py)save_dataset_to_safetensorsartifacts for unit tests to check backward compatibilityHow was it tested?
Checked this PR generates the same datasets on pusht, xarm, aloha.
Saved frames to safetensors on
main(also uncommented the extra frames tested):Generated datasets on new code:
Ran unit tests (also uncommented the extra unit tests):
How to checkout & try? (for the reviewer)
Before submitting
Please read the contributor guideline.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.