Refactor push_dataset_to_hub by Cadene · Pull Request #118 · huggingface/lerobot

Cadene · 2024-04-30T08:36:05Z

What does this PR do?

Fix and run tests/scripts/save_dataset_to_safetensors.py on 3 raw data format: pusht, aloha, xarm
Remove *Processor classes and use functions to simplify the code
Rename {dataset_id}_processor.py to {dataset_id}_{data_format}_format.py to be more explicit (e.g. aloha_hdf5_format.py)
Remove some docstrings and typings to simplify the code
Iterate on the script CLI arguments to simplify
Add video support but deactivated by default (Will be validated in a second PR to come)
Add more save_dataset_to_safetensors artifacts for unit tests to check backward compatibility

How was it tested?

Checked this PR generates the same datasets on pusht, xarm, aloha.

Saved frames to safetensors on main (also uncommented the extra frames tested):

python tests/scripts/save_dataset_to_safetensors.py

Generated datasets on new code:

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id pusht \
--raw-format pusht_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id xarm_lift_medium \
--raw-format xarm_pkl \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id aloha_sim_insertion_scripted \
--raw-format aloha_hdf5 \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 1

Ran unit tests (also uncommented the extra unit tests):

DATA_DIR=data pytest -sx tests/test_datasets.py::test_backward_compatibility

How to checkout & try? (for the reviewer)

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id pusht \
--raw-format pusht_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id xarm_lift_medium \
--raw-format xarm_pkl \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id aloha_sim_insertion_scripted \
--raw-format aloha_hdf5 \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--dataset-id umi_cup_in_the_wild \
--raw-format umi_zarr \
--community-id lerobot \
--revision v1.2 \
--dry-run 1 \
--save-to-disk 1 \
--save-tests-to-disk 0 \
--debug 1

Before submitting

Please read the contributor guideline.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

AdilZouitine

LGTM ! 🚀 Just a minor comment.
Could you also delete the *_processor.py files? They are still present in the project here.

AdilZouitine · 2024-04-30T09:45:11Z

+            # store the episode index
+            ep_dict["observation.image"] = torch.tensor([ep_idx] * num_frames, dtype=torch.int)
+        else:
+            ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array]


Maybe add a to-do item stating that in the future, the processing method will be limited to video format only, as loading individual images into RAM consumes more than 130 GB. Or adding a warning to notify the user, this line ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array] will use a lot of RAM.

Good idea ;) Done

I think there's a misunderstanding. In this line: ep_dict["observation.image"] = [PILImage.fromarray(x) for x in imgs_array], the image itself is directly added to the dictionary, not just the path in 'tmp_umi_images' as in the previous version. Consequently, at the end of the loop, the ep_dicts variable will consume 130 GB of RAM.

Alternatively, in the else clause, we should save the images in a temporary folder and update ep_dict["observation.image"] to include a list of paths from this temporary folder. This approach allows the Image class from the datasets library to convert these paths into a Hugging Face dataset without using excessive RAM. dataset.Image doc

If the else block remains unchanged, then we should warn to the user about the potential 130 GB RAM usage. However, if the else block is modified to use a temporary folder, we should retain the logging you've implemented.

@AdilZouitine I update the warning message like you advised.

FYI I added # load 57MB of images in RAM (400x224x224x3 uint8) and we have 1447 episodes so 1447*57=82GB in RAM ^^

I think we dont care about this video=False setting.

AdilZouitine · 2024-04-30T12:06:21Z

LGTM! 🚀

Cadene changed the title ~~Refactor push_dataset_to_hub.py~~ Refactor push_dataset_to_hub Apr 30, 2024

aliberts added dataset Issues regarding data inputs, processing, or datasets 🔄 Refactor labels Apr 30, 2024

AdilZouitine approved these changes Apr 30, 2024

View reviewed changes

Cadene added 6 commits April 30, 2024 10:55

fix save_dataset_to_safetensors, run on pusht, aloha, xarm

16f34b4

Refactor

0a2ec52

WIP: refactor

18a8459

WIP

91ee45f

Refactor

973d20c

Add save_dataset_to_safetensors artifacts iunit tests

98f4b19

Cadene force-pushed the user/rcadene/2024_04_30_refactor_push_dataset_to_hub branch from 72bcfb9 to 98f4b19 Compare April 30, 2024 11:16

rm all processor

5731dc0

Cadene self-assigned this Apr 30, 2024

Cadene marked this pull request as ready for review April 30, 2024 11:21

add warning umi

90f9bec

Cadene requested review from alexander-soare and aliberts April 30, 2024 11:27

nit

345b629

Update logging

3992738

Cadene merged commit e4e739f into main Apr 30, 2024

Cadene deleted the user/rcadene/2024_04_30_refactor_push_dataset_to_hub branch April 30, 2024 12:25

menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025

Refactor push_dataset_to_hub (huggingface#118)

8bed3dd

Kalcy-U referenced this pull request in Kalcy-U/lerobot May 13, 2025

Refactor push_dataset_to_hub (#118)

205e344

ZoreAnuj pushed a commit to luckyrobots/lerobot that referenced this pull request Jul 29, 2025

Refactor push_dataset_to_hub (huggingface#118)

1af6a1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor push_dataset_to_hub#118

Refactor push_dataset_to_hub#118
Cadene merged 10 commits into
mainfrom
user/rcadene/2024_04_30_refactor_push_dataset_to_hub

Cadene commented Apr 30, 2024 •

edited

Loading

Uh oh!

AdilZouitine left a comment •

edited

Loading

Uh oh!

Uh oh!

AdilZouitine Apr 30, 2024 •

edited

Loading

Uh oh!

Cadene Apr 30, 2024

Uh oh!

AdilZouitine Apr 30, 2024 •

edited

Loading

Uh oh!

Cadene Apr 30, 2024

Uh oh!

Uh oh!

AdilZouitine commented Apr 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Cadene commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How was it tested?

How to checkout & try? (for the reviewer)

Before submitting

Who can review?

Uh oh!

AdilZouitine left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AdilZouitine Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cadene Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

AdilZouitine Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cadene Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AdilZouitine commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cadene commented Apr 30, 2024 •

edited

Loading

AdilZouitine left a comment •

edited

Loading

AdilZouitine Apr 30, 2024 •

edited

Loading

AdilZouitine Apr 30, 2024 •

edited

Loading

AdilZouitine commented Apr 30, 2024 •

edited

Loading