Datasets:

Languages:
English
ArXiv:
DOI:
License:

Convert dataset to Parquet

#20
by lhoestq HF Staff - opened

Hi @yfyeung @mycui @chenxie95 @sedrick66 @Jihuai @cbsjtu01 @AndreasXi :)

This PR converts the dataset to Parquet to

  1. enable the Dataset Viewer
  2. make the dataset easy to load and stream with pandas, dask, pyspark, daft etc. and soon ESPNet 3
  3. make the dataset compatible with the recent datasets V4 which dropped support for loading script

Enabling streaming in particular is especially useful given the size of the dataset. It also enables distributed streaming (e.g. for multi-node model training)

Note that the original data will still be available at https://huggingface.co/datasets/speechcolab/gigaspeech/tree/13eadc735ff81c0e0537276f729f2f391e594bb8/data (since datasets are git repos πŸ˜‰ )

Please merge the PR if it looks good to you !

SpeechColab org

Hi Quentin, thank you for the great work.

As one of the authors of GigaSpeech, I'm wondering if we can make this upgrade more of an 'incremental transition' instead of a hard swap.
We could keep the original tarballs and CSVs available for a while, since hundreds of labs probably have their pipelines hardcoded to those resources. At the same time, we'd offer the .parquet files as a modern and convenient alternative.
With this dual approach, we can battle-test the new format and fix any bugs based on feedback without risking a total "dataset down-time". Since everyone in the speech community depends on HuggingFace, it's safer to have a reliable fallback sitting there just in case, at least for a while.

best
Jiayu

Actually I can keep the current structure and have the Parquet files in a separate folder, how does it sound ? :)

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment