Newest 'huggingface-datasets' Questions

2 votes

1 answer

83 views

Why does my system message content contain "image": None when mapping conversation dataset?

I'm creating a conversation dataset for an image classification task where the system message should contain only text, and the user message contains both text and an image. However, after mapping my ...

GauravGiri

21

asked Oct 1 at 18:05

2 votes

1 answer

3k views

TorchCodec error when loading audio dataset with 🤗datasets [closed]

I’m trying to use the audio dataset Sunbird/urban-noise-uganda-61k with 🤗datasets. After loading the dataset, when I try to access an entry like this: dataset = load_dataset("Sunbird/urban-noise-...

Pranav Nataraj

21

asked Sep 10 at 15:32

0 votes

0 answers

67 views

Why is LeRobot’s policy ignoring additional camera streams despite custom `input_features`?

I'm using LeRobot to train a SO101 arm policy with 3 video streams (front, above, gripper) and a state vector. The dataset can be found at this link. I created a custom JSON config (the train_config....

Aaron Serpilin

31

asked Jul 29 at 13:44

0 votes

0 answers

73 views

Hugging Face applying Transformation on nested to datasets without loading into memory

I am trying to apply below transformation for preparing my datasets for fine tuning using unsloth huggingface. It requires the dataset to be in following format. def convert_to_conversation(sample): ...

SoraHeart

428

asked Jul 4 at 11:27

0 votes

1 answer

119 views

Problem When Using Datasets to Open JSONL

Problem When Using Datasets to Open JSONL I am trying to open a JSONL format file using the datasets library. Here is my code: from datasets import load_dataset path = "./testdata.jsonl" ...

bluebingoSu

1

asked Jun 28 at 18:55

0 votes

0 answers

36 views

How to load "Royc30ne/emnist-byclass" from hugging-face using load_dataset

I've been trying to load Royc30ne/emnist-byclass from hugging-face using the method load_dataset provided by hugging-face/datasets library but failed. First I tried this, which is a common way to load ...

Lumiat

1

asked May 22 at 3:13

0 votes

0 answers

70 views

How to upgrade datasets beyond 2.14.4 in google colab?

I am looking at the pypi site for datasets and it says the latest version of datasets is 3.6.0. However when I am working in google colab and I do: !pip install -U datasets then it says 2.14.4 is my ...

Greg Werner

3

asked May 15 at 16:44

0 votes

1 answer

47 views

How can I convert a Sequence(Image) to an Array4D without going through Seqence(Sequence(Sequence(Sequence()))?

I have a huggingface dataset with a column ImageData that has the featuredescriptor s.features={'images': Sequence(feature=Image(mode=None, decode=True, id=None), length=16, id=None)}. I need to ...

LudvigH

4,944

asked May 13 at 14:36

0 votes

1 answer

91 views

Import onnxruntime then load_dataset "causes ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found": why + how to fix?

Running import onnxruntime as ort from datasets import load_dataset yields the error: (env-312) dernoncourt@pc:~/test$ python SE--test_importpb.py Traceback (most recent call last): File "/...

Franck Dernoncourt

84.8k

asked Apr 30 at 2:13

2 votes

0 answers

23 views

How to Download Images Referenced in a Dataset JSON Split (e.g., image_caption/textcaps)?

I'm trying to work with a dataset from HuggingFace, specifically the image_caption/textcaps split. The JSON file I downloaded lists image filenames (e.g., train/011e7e629fb9ae7b.jpg), but the actual ...

Tixtor 710

81

asked Apr 28 at 9:14

0 votes

1 answer

52 views

Cannot run PyVista/VTK inside a Huggingface multiprocessing map()

The following code crashes, with a forking error. It say objc[81151]: +[NSResponder initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore ...

LudvigH

4,944

asked Apr 15 at 5:34

0 votes

0 answers

122 views

IterableDataset not supported on GRPOTrainer

The following program crashes upon execution from datasets import IterableDataset, Dataset from trl import GRPOConfig, GRPOTrainer prompts = ["Hi", "Hello"] def data_generator(): ...

PMM

376

asked Feb 24 at 5:08

0 votes

0 answers

43 views

How to make Hugging Face's .map() method map a dataset chunck per chunck?

I am currently trying to train a Hugging Face model with a local dataset. I am also using Hugging Face's datasets library to load my local data with the Dataset class and the .from_json() method. ...

pips

21

asked Jan 12 at 15:10

0 votes

1 answer

373 views

HuggingFace Dataset: Load datasets with different set of columns

This is how I load my train and test datasets with HF: dataset = {name.replace('/', '.'): f'{name}/*.parquet' for name in ["train", "test"]} dataset = load_dataset("parquet&...

Ford O.

1,538

asked Dec 18, 2024 at 8:20

0 votes

1 answer

138 views

unexpected transformer's dataset structure after set_transform or with_transform

I am using the feature extractor from ViT like explained here. And noticed a weird behaviour I cannot fully understand. After loading the dataset as in that colab notebook, I see: ds['train'].features ...

hamagust

940

asked Dec 1, 2024 at 14:07

0 votes

1 answer

72 views

Can't iterate over dataset (AttributeError: module 'numpy' has no attribute 'complex'.)

I'm using: windows python version 3.10.0 datasets==2.21.0 numpy==1.24.4 I tried to iterate over dataset I just downloaded: from datasets import load_dataset dataset = load_dataset("jacktol/atc-...

user3668129

4,900

asked Nov 6, 2024 at 9:07

2 votes

2 answers

1k views

multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset

I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a ...

Charlie Parker

6,236

asked Sep 23, 2024 at 0:17

0 votes

1 answer

78 views

How to resolve OOM when .map concatenate the sharded parts?

When using the Dataset.map function: dataset.map(myfunc, num_proc=16, keep_in_memory=False, cache_file_name='parts.arrow', batch_size=16, writer_batch_size=16 ) Due to the size of my ...

alvas

123k

asked Sep 4, 2024 at 1:01

0 votes

1 answer

231 views

Audio File Won't Download Properly From Huggingface Streaming Dataset

I'm running this code to stream the huggingface dataset mozilla-foundation/common_voice_17_0: language = "en" buffer_size = 100 streaming_dataset = load_dataset("mozilla-foundation/...

Bobby Miller

1

asked Aug 27, 2024 at 20:11

1 vote

0 answers

78 views

Iterating a Huggingface Dataset from disk using Generator seems broken. How to do it properly?

I have a strange behavior in HuggingFace Datasets. My minimal reproduction is as below. # main.py import datasets import numpy as np generator = np.random.default_rng(0) X = np.arange(1000) ds = ...

LudvigH

4,944

asked Aug 26, 2024 at 9:55

0 votes

0 answers

194 views

HuggingFace: Efficient Large-Scale Embedding Extraction for DNA Sequences Using Transformers

I have a very large dataframe (60+ million rows) that I would like to use a transformer model to grab the embeddings for these rows (DNA sequences). Basically, this involves tokenizing first, then I ...

youtube

504

asked Aug 6, 2024 at 6:09

0 votes

1 answer

142 views

Using hugging_face load_dataset in VSCode

I am trying to load a training dataset in my VS Code notebook but keep getting an error. This happens exclusively in VS Code, since when I run the same notebook in Colab there is no problem in loading....

Amrat nisa

1

asked Aug 1, 2024 at 7:32

1 vote

1 answer

156 views

Why do I get an exception when attempting automatic processing by the Hugging Face parquet-converter?

What file structure should I use on the Hugging Face Hub, if I have a /train.zip archive with PNG image files and an /metadata.csv file with annotations for them, so that the parquet-converter bot can ...

Artyom Ionash

468

asked Jul 19, 2024 at 9:26

0 votes

1 answer

167 views

How do I successfully set and retrieve metadata information for a HuggingfaceDataset on the Huggingface Hub?

I have a number of datasets, which I create from a dictionary like so: info = DatasetInfo( description="my happy lil dataset", version="0.0.1", homepage=&...

MadDanWithABox

103

asked Jul 17, 2024 at 13:23

2 votes

1 answer

3k views

ImportError: cannot import name 'CommitInfo' from 'huggingface_hub'

I am encountering an ImportError when running a Python script that imports CommitInfo from the huggingface_hub package. The error message is as follows: ImportError: cannot import name 'CommitInfo' ...

Ohm

2,512

asked Jul 11, 2024 at 7:53

1 vote

0 answers

392 views

datasets package from pip causing a segfault on MacOS?

I'm using pip version 24.1.2 and Python 3.12.4. The installation seemingly goes fine. However, when importing the package, like in the line from datasets import load_dataset I'll see zsh: ...

ryanjackson

83

asked Jul 8, 2024 at 5:35

0 votes

1 answer

218 views

List all available dataset-names contained in a hugginface datasets dataset

I want to know which datasets are included in e.g. this collection of huggingface datasets: https://huggingface.co/datasets/autogluon/chronos_datasets "m4_daily" and "weatherbench_daily&...

ivegotaquestion

733

asked Jul 5, 2024 at 8:32

1 vote

1 answer

5k views

How to choose dataset_text_field in SFTTrainer hugging face for my LLM model

Note: Newbie to LLM's Background of my problem I am trying to train a LLM using LLama3 on stackoverflow c langauge dataset. LLm - meta-llama/Meta-Llama-3-8B Dataset - Mxode/StackOverflow-QA-C-Language-...

Bhargav

4,911

asked Jun 30, 2024 at 9:25

0 votes

1 answer

107 views

FileNotFoundError when loading SQuAD dataset with datasets library

I am trying to load the SQuAD dataset using the datasets library in Python, but I am encountering a FileNotFoundError. Here is the code I am using: from datasets import load_dataset dataset = ...

Rtttt

1

asked Jun 23, 2024 at 9:39

0 votes

1 answer

81 views

How to recreate the "view" features of common voice v11 in HuggingFace?

The Common Voice v11 on HuggingFace has some amazing View features! They include a dropdown button to select the language, and columns with the dataset features, such as client_id, audio, sentence, ...

Michel Mesquita

803

asked Jun 18, 2024 at 10:54

0 votes

1 answer

339 views

dataset map() hang and TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'

I encountered a "hang" issue when using hg dataset's map(). I saw that using it while num_proc=os.cpu_count() should help, however when trying this I experienced the following error with the ...

SarahK

1

asked Jun 3, 2024 at 9:49

0 votes

1 answer

86 views

how can I merge multiple columns into array with datasets.Dataset.from_csv()?

I have a CSV file, it has N columns, if I do datasets.Dataset.from_csv(path), it will be N feature each of them is a int value. However, I want to say: column-0 to column-4 is feature1, the rest is ...

Wang

8,436

asked Jun 2, 2024 at 0:21

0 votes

1 answer

409 views

type error while creating custom dataset using huggingface dataset

To generate custom dataset from datasets import Dataset,ClassLabel,Value features = ({ "sentence1": Value("string"), # String type for sentence1 "sentence2": Value(&...

user269867

4,092

asked May 29, 2024 at 14:46

0 votes

1 answer

220 views

Loading huggingface dataset from in-memory text

I have in-memory text, json format, and I am trying to load dataset (HuggingFace) directly from text in-memory. If I will save it into file - I can load the dataset using huggingface load_dataset: ...

Noam Gershi

97

asked May 22, 2024 at 22:36

1 vote

0 answers

278 views

How to split a Hugging Face dataset in streaming mode without loading it into memory?

I'm working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don't want ...

Charlie Parker

6,236

asked May 17, 2024 at 17:18

1 vote

0 answers

726 views

How to apply .map() function and keep it as an iterator for a Hugging Face Dataset, in Streaming Mode without loading it to memory?

I'm currently working with the Hugging Face datasets library and need to apply transformations to multiple datasets (such as ds_khan and ds_mathematica) using the .map() function, but in a way that ...

Charlie Parker

6,236

asked May 17, 2024 at 5:14

1 vote

1 answer

846 views

Hugging Face Datasets .map not working as expected

I'm running a function over a dataset, but when I compute this, I seem to replace my existing dataset rather than adding to it. What is going wrong? dataset_c = Dataset.from_pandas(df_all[0:100]) ...

disruptive

6,026

asked May 13, 2024 at 14:36

1 vote

2 answers

1k views

Getting a pyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21 when trying to create Hugging Face database

I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset. I was wondering if anyone might be able to help? Traceback (most recent call last): File "C:\Users\...

Connor Davidson

98

asked May 7, 2024 at 13:14

0 votes

1 answer

1k views

How to drop rows with empty values in Huggingface dataset?

After I have loaded a huggingface dataset download_config = DownloadConfig() dataset = load_dataset (hf_dataset_name, download_config=download_config) dataset_split = dataset ['train'] Let say if ...

Hoo

93

asked Apr 25, 2024 at 16:12

0 votes

0 answers

61 views

python - ImportError: cannot import name '_is_imported_module' from 'dill._dill'

Installed datasets package into python virtual environment. When I try to import it, running, from datasets import load_dataset, I get this error, "ImportError: cannot import name '...

Ju Chen

1

asked Apr 19, 2024 at 4:11

-1 votes

1 answer

218 views

How to manage that escapes for the double quotes `'\"'` inside the 'user content' for training datasets will not be removed?

1. Objective The objective is to ensure the training data keeps the needed format for a model training. Using the SFTTrainer model training. The SFTTrainer has a parameter train_dataset=dataset, that ...

Thomas Suedbroecker

471

asked Apr 11, 2024 at 16:57

1 vote

0 answers

354 views

How to train Hugging Face Model On Multiple Datasets?

I am trying to fine tune a model based on two datasets, following the example on the Hugging Face website, I have my model training on the Yelp Review dataset, but I also want to train my model on the ...

Bigbob556677

2,198

asked Apr 9, 2024 at 22:28

0 votes

1 answer

642 views

Error when calling Hugging Face load_dataset("glue", "mrpc")

I'm following the huggingface tutorial here and it's giving me a strange error. When I run the following code: from datasets import load_dataset from transformers import AutoTokenizer, ...

Ameen Izhac

133

asked Apr 8, 2024 at 19:21

2 votes

3 answers

4k views

How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?

I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows: pip install huggingface_hub[hf_transfer] huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --...

Franck Dernoncourt

84.8k

asked Apr 5, 2024 at 0:18

1 vote

1 answer

1k views

How to augment dataset by adding rows via huggingface datasets?

I have a dataset with 113287 train rows. Each 'caption' field is however an array with multiple strings. I would like to flatmap this array and add new rows. The documentation for datasets states that ...

Jotschi

3,682

asked Mar 13, 2024 at 23:11

0 votes

0 answers

477 views

When trying to import the hugging face package "datasets" I get an attribute error from PyArrow

I have tried to start environments with several different python versions and installed pyarrow in different versions. Nothing worked where can it be coming from? AttributeError ...

adigianv

1

asked Mar 3, 2024 at 10:07

0 votes

1 answer

818 views

Huggingface load_dataset messes up the structure of the dataset

Following https://huggingface.co/docs/datasets/en/loading#json I am trying to load this dataset https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json ...

user25004

2,108

asked Feb 16, 2024 at 16:53

2 votes

1 answer

3k views

How to randomly sample very large pyArrow dataset

I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot ...

youtube

504

asked Feb 16, 2024 at 6:09

1 vote

1 answer

2k views

Is there any way to download only a partition of the whole dataset from huggingface

I am trying to finetune a facebook/wav2vec2 model on Automatic Speech Recognition (ASR) with common voice dataset, but I stumbled upon an issue that my disk space is not enough to hold this large ...

Philip

13

asked Feb 13, 2024 at 9:47

0 votes

1 answer

183 views

NameError: name 'Path' is not defined when using HF.Dataset.from_generator

train_json_files = glob(paths.TRAIN_JSON_FOLDER + "*.json") from pathlib import Path def get_gt_string_and_xy(filepath: Union[str, os.PathLike]) -> Dict[str, str]: """ ...

Shadowpulse

17

asked Jan 22, 2024 at 18:14

Collectives™ on Stack Overflow