275 questions
2
votes
1
answer
83
views
Why does my system message content contain "image": None when mapping conversation dataset?
I'm creating a conversation dataset for an image classification task where the system message should contain only text, and the user message contains both text and an image. However, after mapping my ...
2
votes
1
answer
3k
views
TorchCodec error when loading audio dataset with 🤗datasets [closed]
I’m trying to use the audio dataset Sunbird/urban-noise-uganda-61k
with 🤗datasets.
After loading the dataset, when I try to access an entry like this:
dataset = load_dataset("Sunbird/urban-noise-...
0
votes
0
answers
67
views
Why is LeRobot’s policy ignoring additional camera streams despite custom `input_features`?
I'm using LeRobot to train a SO101 arm policy with 3 video streams (front, above, gripper) and a state vector. The dataset can be found at this link.
I created a custom JSON config (the train_config....
0
votes
0
answers
73
views
Hugging Face applying Transformation on nested to datasets without loading into memory
I am trying to apply below transformation for preparing my datasets for fine tuning using unsloth huggingface. It requires the dataset to be in following format.
def convert_to_conversation(sample):
...
0
votes
1
answer
119
views
Problem When Using Datasets to Open JSONL
Problem When Using Datasets to Open JSONL
I am trying to open a JSONL format file using the datasets library. Here is my code:
from datasets import load_dataset
path = "./testdata.jsonl"
...
0
votes
0
answers
36
views
How to load "Royc30ne/emnist-byclass" from hugging-face using load_dataset
I've been trying to load Royc30ne/emnist-byclass from hugging-face using the method load_dataset provided by hugging-face/datasets library but failed.
First I tried this, which is a common way to load ...
0
votes
0
answers
70
views
How to upgrade datasets beyond 2.14.4 in google colab?
I am looking at the pypi site for datasets and it says the latest version of datasets is 3.6.0. However when I am working in google colab and I do:
!pip install -U datasets
then it says 2.14.4 is my ...
0
votes
1
answer
47
views
How can I convert a Sequence(Image) to an Array4D without going through Seqence(Sequence(Sequence(Sequence()))?
I have a huggingface dataset with a column ImageData that has the featuredescriptor s.features={'images': Sequence(feature=Image(mode=None, decode=True, id=None), length=16, id=None)}.
I need to ...
0
votes
1
answer
91
views
Import onnxruntime then load_dataset "causes ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found": why + how to fix?
Running
import onnxruntime as ort
from datasets import load_dataset
yields the error:
(env-312) dernoncourt@pc:~/test$ python SE--test_importpb.py
Traceback (most recent call last):
File "/...
2
votes
0
answers
23
views
How to Download Images Referenced in a Dataset JSON Split (e.g., image_caption/textcaps)?
I'm trying to work with a dataset from HuggingFace, specifically the image_caption/textcaps split.
The JSON file I downloaded lists image filenames (e.g., train/011e7e629fb9ae7b.jpg), but the actual ...
0
votes
1
answer
52
views
Cannot run PyVista/VTK inside a Huggingface multiprocessing map()
The following code crashes, with a forking error. It say objc[81151]: +[NSResponder initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore ...
0
votes
0
answers
122
views
IterableDataset not supported on GRPOTrainer
The following program crashes upon execution
from datasets import IterableDataset, Dataset
from trl import GRPOConfig, GRPOTrainer
prompts = ["Hi", "Hello"]
def data_generator():
...
0
votes
0
answers
43
views
How to make Hugging Face's .map() method map a dataset chunck per chunck?
I am currently trying to train a Hugging Face model with a local dataset. I am also using Hugging Face's datasets library to load my local data with the Dataset class and the .from_json() method. ...
0
votes
1
answer
373
views
HuggingFace Dataset: Load datasets with different set of columns
This is how I load my train and test datasets with HF:
dataset = {name.replace('/', '.'): f'{name}/*.parquet' for name in ["train", "test"]}
dataset = load_dataset("parquet&...
0
votes
1
answer
138
views
unexpected transformer's dataset structure after set_transform or with_transform
I am using the feature extractor from ViT like explained here.
And noticed a weird behaviour I cannot fully understand.
After loading the dataset as in that colab notebook, I see:
ds['train'].features
...
0
votes
1
answer
72
views
Can't iterate over dataset (AttributeError: module 'numpy' has no attribute 'complex'.)
I'm using:
windows
python version 3.10.0
datasets==2.21.0
numpy==1.24.4
I tried to iterate over dataset I just downloaded:
from datasets import load_dataset
dataset = load_dataset("jacktol/atc-...
2
votes
2
answers
1k
views
multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset
I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a ...
0
votes
1
answer
78
views
How to resolve OOM when .map concatenate the sharded parts?
When using the Dataset.map function:
dataset.map(myfunc, num_proc=16,
keep_in_memory=False,
cache_file_name='parts.arrow',
batch_size=16, writer_batch_size=16
)
Due to the size of my ...
0
votes
1
answer
231
views
Audio File Won't Download Properly From Huggingface Streaming Dataset
I'm running this code to stream the huggingface dataset mozilla-foundation/common_voice_17_0:
language = "en"
buffer_size = 100
streaming_dataset = load_dataset("mozilla-foundation/...
1
vote
0
answers
78
views
Iterating a Huggingface Dataset from disk using Generator seems broken. How to do it properly?
I have a strange behavior in HuggingFace Datasets. My minimal reproduction is as below.
# main.py
import datasets
import numpy as np
generator = np.random.default_rng(0)
X = np.arange(1000)
ds = ...
0
votes
0
answers
194
views
HuggingFace: Efficient Large-Scale Embedding Extraction for DNA Sequences Using Transformers
I have a very large dataframe (60+ million rows) that I would like to use a transformer model to grab the embeddings for these rows (DNA sequences). Basically, this involves tokenizing first, then I ...
0
votes
1
answer
142
views
Using hugging_face load_dataset in VSCode
I am trying to load a training dataset in my VS Code notebook but keep getting an error. This happens exclusively in VS Code, since when I run the same notebook in Colab there is no problem in loading....
1
vote
1
answer
156
views
Why do I get an exception when attempting automatic processing by the Hugging Face parquet-converter?
What file structure should I use on the Hugging Face Hub, if I have a /train.zip archive with PNG image files and an /metadata.csv file with annotations for them, so that the parquet-converter bot can ...
0
votes
1
answer
167
views
How do I successfully set and retrieve metadata information for a HuggingfaceDataset on the Huggingface Hub?
I have a number of datasets, which I create from a dictionary like so:
info = DatasetInfo(
description="my happy lil dataset",
version="0.0.1",
homepage=&...
2
votes
1
answer
3k
views
ImportError: cannot import name 'CommitInfo' from 'huggingface_hub'
I am encountering an ImportError when running a Python script that imports CommitInfo from the huggingface_hub package. The error message is as follows:
ImportError: cannot import name 'CommitInfo' ...
1
vote
0
answers
392
views
datasets package from pip causing a segfault on MacOS?
I'm using pip version 24.1.2 and Python 3.12.4. The installation seemingly goes fine. However, when importing the package, like in the line
from datasets import load_dataset
I'll see
zsh: ...
0
votes
1
answer
218
views
List all available dataset-names contained in a hugginface datasets dataset
I want to know which datasets are included in e.g. this collection of huggingface datasets:
https://huggingface.co/datasets/autogluon/chronos_datasets
"m4_daily" and "weatherbench_daily&...
1
vote
1
answer
5k
views
How to choose dataset_text_field in SFTTrainer hugging face for my LLM model
Note: Newbie to LLM's
Background of my problem
I am trying to train a LLM using LLama3 on stackoverflow c langauge dataset.
LLm - meta-llama/Meta-Llama-3-8B
Dataset - Mxode/StackOverflow-QA-C-Language-...
0
votes
1
answer
107
views
FileNotFoundError when loading SQuAD dataset with datasets library
I am trying to load the SQuAD dataset using the datasets library in Python, but I am encountering a FileNotFoundError. Here is the code I am using:
from datasets import load_dataset
dataset = ...
0
votes
1
answer
81
views
How to recreate the "view" features of common voice v11 in HuggingFace?
The Common Voice v11 on HuggingFace has some amazing View features! They include a dropdown button to select the language, and columns with the dataset features, such as client_id, audio, sentence, ...
0
votes
1
answer
339
views
dataset map() hang and TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'
I encountered a "hang" issue when using hg dataset's map(). I saw that using it while num_proc=os.cpu_count() should help, however when trying this I experienced the following error with the ...
0
votes
1
answer
86
views
how can I merge multiple columns into array with datasets.Dataset.from_csv()?
I have a CSV file, it has N columns, if I do datasets.Dataset.from_csv(path), it will be N feature each of them is a int value.
However, I want to say: column-0 to column-4 is feature1, the rest is ...
0
votes
1
answer
409
views
type error while creating custom dataset using huggingface dataset
To generate custom dataset
from datasets import Dataset,ClassLabel,Value
features = ({
"sentence1": Value("string"), # String type for sentence1
"sentence2": Value(&...
0
votes
1
answer
220
views
Loading huggingface dataset from in-memory text
I have in-memory text, json format, and I am trying to load dataset (HuggingFace) directly from text in-memory.
If I will save it into file - I can load the dataset using huggingface load_dataset:
...
1
vote
0
answers
278
views
How to split a Hugging Face dataset in streaming mode without loading it into memory?
I'm working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don't want ...
1
vote
0
answers
726
views
How to apply .map() function and keep it as an iterator for a Hugging Face Dataset, in Streaming Mode without loading it to memory?
I'm currently working with the Hugging Face datasets library and need to apply transformations to multiple datasets (such as ds_khan and ds_mathematica) using the .map() function, but in a way that ...
1
vote
1
answer
846
views
Hugging Face Datasets .map not working as expected
I'm running a function over a dataset, but when I compute this, I seem to replace my existing dataset rather than adding to it. What is going wrong?
dataset_c = Dataset.from_pandas(df_all[0:100])
...
1
vote
2
answers
1k
views
Getting a pyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21 when trying to create Hugging Face database
I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset.
I was wondering if anyone might be able to help?
Traceback (most recent call last):
File "C:\Users\...
0
votes
1
answer
1k
views
How to drop rows with empty values in Huggingface dataset?
After I have loaded a huggingface dataset
download_config = DownloadConfig()
dataset = load_dataset (hf_dataset_name, download_config=download_config)
dataset_split = dataset ['train']
Let say if ...
0
votes
0
answers
61
views
python - ImportError: cannot import name '_is_imported_module' from 'dill._dill'
Installed datasets package into python virtual environment. When I try to import it, running,
from datasets import load_dataset, I get this error, "ImportError: cannot import name '...
-1
votes
1
answer
218
views
How to manage that escapes for the double quotes `'\"'` inside the 'user content' for training datasets will not be removed?
1. Objective
The objective is to ensure the training data keeps the needed format for a model training.
Using the SFTTrainer model training. The SFTTrainer has a parameter train_dataset=dataset, that ...
1
vote
0
answers
354
views
How to train Hugging Face Model On Multiple Datasets?
I am trying to fine tune a model based on two datasets, following the example on the Hugging Face website, I have my model training on the Yelp Review dataset, but I also want to train my model on the ...
0
votes
1
answer
642
views
Error when calling Hugging Face load_dataset("glue", "mrpc")
I'm following the huggingface tutorial here and it's giving me a strange error. When I run the following code:
from datasets import load_dataset
from transformers import AutoTokenizer, ...
2
votes
3
answers
4k
views
How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?
I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows:
pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --...
1
vote
1
answer
1k
views
How to augment dataset by adding rows via huggingface datasets?
I have a dataset with 113287 train rows. Each 'caption' field is however an array with multiple strings. I would like to flatmap this array and add new rows.
The documentation for datasets states that ...
0
votes
0
answers
477
views
When trying to import the hugging face package "datasets" I get an attribute error from PyArrow
I have tried to start environments with several different python versions and installed pyarrow in different versions. Nothing worked where can it be coming from?
AttributeError ...
0
votes
1
answer
818
views
Huggingface load_dataset messes up the structure of the dataset
Following https://huggingface.co/docs/datasets/en/loading#json
I am trying to load this dataset https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json
...
2
votes
1
answer
3k
views
How to randomly sample very large pyArrow dataset
I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot ...
1
vote
1
answer
2k
views
Is there any way to download only a partition of the whole dataset from huggingface
I am trying to finetune a facebook/wav2vec2 model on Automatic Speech Recognition (ASR) with common voice dataset, but I stumbled upon an issue that my disk space is not enough to hold this large ...
0
votes
1
answer
183
views
NameError: name 'Path' is not defined when using HF.Dataset.from_generator
train_json_files = glob(paths.TRAIN_JSON_FOLDER + "*.json")
from pathlib import Path
def get_gt_string_and_xy(filepath: Union[str, os.PathLike]) -> Dict[str, str]:
"""
...