Datasets:

speechcolab
/

gigaspeech

enable the Dataset Viewer
make the dataset easy to load and stream with pandas, dask, pyspark, daft etc. and soon ESPNet 3
make the dataset compatible with the recent datasets V4 which dropped support for loading script

Enabling streaming in particular is especially useful given the size of the dataset. It also enables distributed streaming (e.g. for multi-node model training)

Note that the original data will still be available at https://huggingface.co/datasets/speechcolab/gigaspeech/tree/13eadc735ff81c0e0537276f729f2f391e594bb8/data (since datasets are git repos 😉 )

Please merge the PR if it looks good to you !

Convert dataset to Parquet2bd58e5c

Add 's' config data files123988df

Add 'm' config data files03765fb2

Add 'l' config data files (part 00000-of-00002)ce150bf0

Add 'l' config data files (part 00001-of-00002)ce43e1b5

Add 'xl' config data files (part 00000-of-00006)05152cc7

Add 'xl' config data files (part 00001-of-00006)6a912c19

Add 'xl' config data files (part 00002-of-00006)1f7db0e9

Add 'xl' config data files (part 00003-of-00006)f8bc57eb

Add 'xl' config data files (part 00004-of-00006)37371e88

Add 'xl' config data files (part 00005-of-00006)dfd71048

Add 'dev' config data files9be79356

Add 'test' config data filesc8f71994

Delete loading script5f5decbd

Delete old data filesd701e6f9

speechio

SpeechColab org about 15 hours ago

Hi Quentin, thank you for the great work.

As one of the authors of GigaSpeech, I'm wondering if we can make this upgrade more of an 'incremental transition' instead of a hard swap.
We could keep the original tarballs and CSVs available for a while, since hundreds of labs probably have their pipelines hardcoded to those resources. At the same time, we'd offer the .parquet files as a modern and convenient alternative.
With this dual approach, we can battle-test the new format and fix any bugs based on feedback without risking a total "dataset down-time". Since everyone in the speech community depends on HuggingFace, it's safer to have a reliable fallback sitting there just in case, at least for a while.

best
Jiayu

lhoestq

about 4 hours ago

Actually I can keep the current structure and have the Parquet files in a separate folder, how does it sound ? :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment