Convert dataset to Parquet
Hi @yfyeung @mycui @chenxie95 @sedrick66 @Jihuai @cbsjtu01 @AndreasXi :)
This PR converts the dataset to Parquet to
- enable the Dataset Viewer
- make the dataset easy to load and stream with
pandas,dask,pyspark,daftetc. and soonESPNet 3 - make the dataset compatible with the recent
datasetsV4 which dropped support for loading script
Enabling streaming in particular is especially useful given the size of the dataset. It also enables distributed streaming (e.g. for multi-node model training)
Note that the original data will still be available at https://huggingface.co/datasets/speechcolab/gigaspeech/tree/13eadc735ff81c0e0537276f729f2f391e594bb8/data (since datasets are git repos π )
Please merge the PR if it looks good to you !
Hi Quentin, thank you for the great work.
As one of the authors of GigaSpeech, I'm wondering if we can make this upgrade more of an 'incremental transition' instead of a hard swap.
We could keep the original tarballs and CSVs available for a while, since hundreds of labs probably have their pipelines hardcoded to those resources. At the same time, we'd offer the .parquet files as a modern and convenient alternative.
With this dual approach, we can battle-test the new format and fix any bugs based on feedback without risking a total "dataset down-time". Since everyone in the speech community depends on HuggingFace, it's safer to have a reliable fallback sitting there just in case, at least for a while.
best
Jiayu
Actually I can keep the current structure and have the Parquet files in a separate folder, how does it sound ? :)