Skip to content

Refactor push_dataset_to_hub#118

Merged
Cadene merged 10 commits into
mainfrom
user/rcadene/2024_04_30_refactor_push_dataset_to_hub
Apr 30, 2024
Merged

Refactor push_dataset_to_hub#118
Cadene merged 10 commits into
mainfrom
user/rcadene/2024_04_30_refactor_push_dataset_to_hub

Conversation

@Cadene
Copy link
Copy Markdown
Contributor

@Cadene Cadene commented Apr 30, 2024

What does this PR do?

  • Fix and run tests/scripts/save_dataset_to_safetensors.py on 3 raw data format: pusht, aloha, xarm
  • Remove *Processor classes and use functions to simplify the code
  • Rename {dataset_id}_processor.py to {dataset_id}_{data_format}_format.py to be more explicit (e.g. aloha_hdf5_format.py)
  • Remove some docstrings and typings to simplify the code
  • Iterate on the script CLI arguments to simplify
  • Add video support but deactivated by default (Will be validated in a second PR to come)
  • Add more save_dataset_to_safetensors artifacts for unit tests to check backward compatibility

How was it tested?

Checked this PR generates the same datasets on pusht, xarm, aloha.

Saved frames to safetensors on main (also uncommented the extra frames tested):

python tests/scripts/save_dataset_to_safetensors.py

Generated datasets on new code:

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id pusht \
--raw-format pusht_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id xarm_lift_medium \
--raw-format xarm_pkl \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id aloha_sim_insertion_scripted \
--raw-format aloha_hdf5 \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

Ran unit tests (also uncommented the extra unit tests):

DATA_DIR=data pytest -sx tests/test_datasets.py::test_backward_compatibility

How to checkout & try? (for the reviewer)

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id pusht \
--raw-format pusht_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id xarm_lift_medium \
--raw-format xarm_pkl \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id aloha_sim_insertion_scripted \
--raw-format aloha_hdf5 \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id umi_cup_in_the_wild \
--raw-format umi_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

Before submitting

Please read the contributor guideline.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

@Cadene Cadene changed the title Refactor push_dataset_to_hub.py Refactor push_dataset_to_hub Apr 30, 2024
@aliberts aliberts added dataset Issues regarding data inputs, processing, or datasets 🔄 Refactor labels Apr 30, 2024
Copy link
Copy Markdown
Collaborator

@AdilZouitine AdilZouitine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! 🚀 Just a minor comment.
Could you also delete the *_processor.py files? They are still present in the project here.

Comment thread lerobot/common/datasets/push_dataset_to_hub/aloha_hdf5_format.py
# store the episode index
ep_dict["observation.image"] = torch.tensor([ep_idx] * num_frames, dtype=torch.int)
else:
ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array]
Copy link
Copy Markdown
Collaborator

@AdilZouitine AdilZouitine Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a to-do item stating that in the future, the processing method will be limited to video format only, as loading individual images into RAM consumes more than 130 GB. Or adding a warning to notify the user, this line ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array] will use a lot of RAM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea ;) Done

Copy link
Copy Markdown
Collaborator

@AdilZouitine AdilZouitine Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a misunderstanding. In this line: ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array], the image itself is directly added to the dictionary, not just the path in 'tmp_umi_images' as in the previous version. Consequently, at the end of the loop, the ep_dicts variable will consume 130 GB of RAM.

Alternatively, in the else clause, we should save the images in a temporary folder and update ep_dict["observation.image"] to include a list of paths from this temporary folder. This approach allows the Image class from the datasets library to convert these paths into a Hugging Face dataset without using excessive RAM. dataset.Image doc

If the else block remains unchanged, then we should warn to the user about the potential 130 GB RAM usage. However, if the else block is modified to use a temporary folder, we should retain the logging you've implemented.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdilZouitine I update the warning message like you advised.

FYI I added # load 57MB of images in RAM (400x224x224x3 uint8) and we have 1447 episodes so 1447*57=82GB in RAM ^^

I think we dont care about this video=False setting.

Comment thread lerobot/scripts/push_dataset_to_hub.py
@Cadene Cadene force-pushed the user/rcadene/2024_04_30_refactor_push_dataset_to_hub branch from 72bcfb9 to 98f4b19 Compare April 30, 2024 11:16
@Cadene Cadene self-assigned this Apr 30, 2024
@Cadene Cadene marked this pull request as ready for review April 30, 2024 11:21
@AdilZouitine
Copy link
Copy Markdown
Collaborator

AdilZouitine commented Apr 30, 2024

LGTM! 🚀

@Cadene Cadene merged commit e4e739f into main Apr 30, 2024
@Cadene Cadene deleted the user/rcadene/2024_04_30_refactor_push_dataset_to_hub branch April 30, 2024 12:25
menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025
Kalcy-U referenced this pull request in Kalcy-U/lerobot May 13, 2025
ZoreAnuj pushed a commit to luckyrobots/lerobot that referenced this pull request Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants