I have a large .h5 file of high resolution images (~300MB each, 200 images per .h5 file) and need to load samples in python. The current setup uses a separate dataset for each sample.
data_group.create_dataset(k_name, data=knth)
And the dataset is simply accessed as
with h5py.File("src.h5", "r") as f:
sample = load_data(f)
Chunking seems to help only marginally (when loading two samples at a time), and the chunks exceed the recommended size of 1mb significantly.
data_group.create_dataset("knth", data=data, chunks=(2, *resolution))
The main goal is to speed up access (as it's used in machine learning), the writing time is not a concern.
I am aware that the sheer amount of data is the limiting factor here, but is there any setting within h5py that can help? If not, is there any alternative to h5py specific for this scenario?
Thanks!
"knth". Are all images stored in"knth"? What is theshapeandchunksize(orresolution)? If you are slicing images, it's not clear (to me) why you setchunks=(2, *resolution).knthis a single multidimensional grid with shape of(2, 32, 32, 32, 32, 32). A single "image" is much larger than 1mb, hence why chunking does not help. The entire dataset is made of multiple files with ~200 of such samples.