Speed up read access of large (~300mb) samples with H5py

Ask Question

Asked 9 months ago

Modified 9 months ago

Viewed 76 times

I have a large .h5 file of high resolution images (~300MB each, 200 images per .h5 file) and need to load samples in python. The current setup uses a separate dataset for each sample.

data_group.create_dataset(k_name, data=knth)

And the dataset is simply accessed as

with h5py.File("src.h5", "r") as f:
    sample = load_data(f)

Chunking seems to help only marginally (when loading two samples at a time), and the chunks exceed the recommended size of 1mb significantly.

data_group.create_dataset("knth", data=data, chunks=(2, *resolution))

The main goal is to speed up access (as it's used in machine learning), the writing time is not a concern.

I am aware that the sheer amount of data is the limiting factor here, but is there any setting within h5py that can help? If not, is there any alternative to h5py specific for this scenario?

Thanks!

asked Mar 7 at 16:11

gekrone

17910 bronze badges

Can you tell us more about the "need to load samples"? I expect the h5py reads to be bound by the storage device bandwidth here. Storage devices are generally slow. You can profile it to be sure.

Jérôme Richard
– Jérôme Richard

2025-03-07 16:31:47 +00:00
Commented Mar 7 at 16:31
@JérômeRichard Sure. Samples are randomly accessed and loaded in a batch used for training, then discarded. The entire dataset cannot fit in memory, but a possibility there would be smarter sampling than random. The "raw" disk read speed is around 5.0GB/s, which is faster compared to reading from h5 at ~1.5GB/s. I can do some proper profiling

gekrone
– gekrone

2025-03-07 17:07:07 +00:00
Commented Mar 7 at 17:07
But what are those samples? are they a sub-slice? How big there are (shape)? How many there are per image in average? How many sample there are? This matters for performance.

Jérôme Richard
– Jérôme Richard

2025-03-08 11:58:48 +00:00
Commented Mar 8 at 11:58
It's generally not a good idea to set chunks significantly larger than 1mb. To answer your question, please add details about dataset "knth". Are all images stored in "knth"? What is the shape and chunksize (or resolution)? If you are slicing images, it's not clear (to me) why you set chunks=(2, *resolution).

kcw78
– kcw78

2025-03-08 19:20:45 +00:00
Commented Mar 8 at 19:20
Thanks for your answers, knth is a single multidimensional grid with shape of (2, 32, 32, 32, 32, 32). A single "image" is much larger than 1mb, hence why chunking does not help. The entire dataset is made of multiple files with ~200 of such samples.

gekrone
– gekrone

2025-03-18 15:42:14 +00:00
Commented Mar 18 at 15:42

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Speed up read access of large (~300mb) samples with H5py

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest