0

I have a large .h5 file of high resolution images (~300MB each, 200 images per .h5 file) and need to load samples in python. The current setup uses a separate dataset for each sample.

data_group.create_dataset(k_name, data=knth)

And the dataset is simply accessed as

with h5py.File("src.h5", "r") as f:
    sample = load_data(f)

Chunking seems to help only marginally (when loading two samples at a time), and the chunks exceed the recommended size of 1mb significantly.

data_group.create_dataset("knth", data=data, chunks=(2, *resolution))

The main goal is to speed up access (as it's used in machine learning), the writing time is not a concern.

I am aware that the sheer amount of data is the limiting factor here, but is there any setting within h5py that can help? If not, is there any alternative to h5py specific for this scenario?

Thanks!

5
  • Can you tell us more about the "need to load samples"? I expect the h5py reads to be bound by the storage device bandwidth here. Storage devices are generally slow. You can profile it to be sure. Commented Mar 7 at 16:31
  • @JérômeRichard Sure. Samples are randomly accessed and loaded in a batch used for training, then discarded. The entire dataset cannot fit in memory, but a possibility there would be smarter sampling than random. The "raw" disk read speed is around 5.0GB/s, which is faster compared to reading from h5 at ~1.5GB/s. I can do some proper profiling Commented Mar 7 at 17:07
  • But what are those samples? are they a sub-slice? How big there are (shape)? How many there are per image in average? How many sample there are? This matters for performance. Commented Mar 8 at 11:58
  • It's generally not a good idea to set chunks significantly larger than 1mb. To answer your question, please add details about dataset "knth". Are all images stored in "knth"? What is the shape and chunksize (or resolution)? If you are slicing images, it's not clear (to me) why you set chunks=(2, *resolution). Commented Mar 8 at 19:20
  • Thanks for your answers, knth is a single multidimensional grid with shape of (2, 32, 32, 32, 32, 32). A single "image" is much larger than 1mb, hence why chunking does not help. The entire dataset is made of multiple files with ~200 of such samples. Commented Mar 18 at 15:42

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.