Skip to main content
Filter by
Sorted by
Tagged with
2 votes
1 answer
83 views

I'm creating a conversation dataset for an image classification task where the system message should contain only text, and the user message contains both text and an image. However, after mapping my ...
GauravGiri's user avatar
2 votes
1 answer
3k views

I’m trying to use the audio dataset Sunbird/urban-noise-uganda-61k with 🤗datasets. After loading the dataset, when I try to access an entry like this: dataset = load_dataset("Sunbird/urban-noise-...
Pranav Nataraj's user avatar
0 votes
0 answers
67 views

I'm using LeRobot to train a SO101 arm policy with 3 video streams (front, above, gripper) and a state vector. The dataset can be found at this link. I created a custom JSON config (the train_config....
Aaron Serpilin's user avatar
0 votes
0 answers
73 views

I am trying to apply below transformation for preparing my datasets for fine tuning using unsloth huggingface. It requires the dataset to be in following format. def convert_to_conversation(sample): ...
SoraHeart's user avatar
  • 428
0 votes
1 answer
119 views

Problem When Using Datasets to Open JSONL I am trying to open a JSONL format file using the datasets library. Here is my code: from datasets import load_dataset path = "./testdata.jsonl" ...
bluebingoSu's user avatar
0 votes
0 answers
36 views

I've been trying to load Royc30ne/emnist-byclass from hugging-face using the method load_dataset provided by hugging-face/datasets library but failed. First I tried this, which is a common way to load ...
Lumiat's user avatar
  • 1
0 votes
0 answers
70 views

I am looking at the pypi site for datasets and it says the latest version of datasets is 3.6.0. However when I am working in google colab and I do: !pip install -U datasets then it says 2.14.4 is my ...
Greg Werner's user avatar
0 votes
1 answer
47 views

I have a huggingface dataset with a column ImageData that has the featuredescriptor s.features={'images': Sequence(feature=Image(mode=None, decode=True, id=None), length=16, id=None)}. I need to ...
LudvigH's user avatar
  • 4,944
0 votes
1 answer
91 views

Running import onnxruntime as ort from datasets import load_dataset yields the error: (env-312) dernoncourt@pc:~/test$ python SE--test_importpb.py Traceback (most recent call last): File "/...
Franck Dernoncourt's user avatar
2 votes
0 answers
23 views

I'm trying to work with a dataset from HuggingFace, specifically the image_caption/textcaps split. The JSON file I downloaded lists image filenames (e.g., train/011e7e629fb9ae7b.jpg), but the actual ...
Tixtor 710's user avatar
0 votes
1 answer
52 views

The following code crashes, with a forking error. It say objc[81151]: +[NSResponder initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore ...
LudvigH's user avatar
  • 4,944
0 votes
0 answers
122 views

The following program crashes upon execution from datasets import IterableDataset, Dataset from trl import GRPOConfig, GRPOTrainer prompts = ["Hi", "Hello"] def data_generator(): ...
PMM's user avatar
  • 376
0 votes
0 answers
43 views

I am currently trying to train a Hugging Face model with a local dataset. I am also using Hugging Face's datasets library to load my local data with the Dataset class and the .from_json() method. ...
pips's user avatar
  • 21
0 votes
1 answer
373 views

This is how I load my train and test datasets with HF: dataset = {name.replace('/', '.'): f'{name}/*.parquet' for name in ["train", "test"]} dataset = load_dataset("parquet&...
Ford O.'s user avatar
  • 1,538
0 votes
1 answer
138 views

I am using the feature extractor from ViT like explained here. And noticed a weird behaviour I cannot fully understand. After loading the dataset as in that colab notebook, I see: ds['train'].features ...
hamagust's user avatar
  • 940
0 votes
1 answer
72 views

I'm using: windows python version 3.10.0 datasets==2.21.0 numpy==1.24.4 I tried to iterate over dataset I just downloaded: from datasets import load_dataset dataset = load_dataset("jacktol/atc-...
user3668129's user avatar
  • 4,900
2 votes
2 answers
1k views

I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a ...
Charlie Parker's user avatar
0 votes
1 answer
78 views

When using the Dataset.map function: dataset.map(myfunc, num_proc=16, keep_in_memory=False, cache_file_name='parts.arrow', batch_size=16, writer_batch_size=16 ) Due to the size of my ...
alvas's user avatar
  • 123k
0 votes
1 answer
231 views

I'm running this code to stream the huggingface dataset mozilla-foundation/common_voice_17_0: language = "en" buffer_size = 100 streaming_dataset = load_dataset("mozilla-foundation/...
Bobby Miller's user avatar
1 vote
0 answers
78 views

I have a strange behavior in HuggingFace Datasets. My minimal reproduction is as below. # main.py import datasets import numpy as np generator = np.random.default_rng(0) X = np.arange(1000) ds = ...
LudvigH's user avatar
  • 4,944
0 votes
0 answers
194 views

I have a very large dataframe (60+ million rows) that I would like to use a transformer model to grab the embeddings for these rows (DNA sequences). Basically, this involves tokenizing first, then I ...
youtube's user avatar
  • 504
0 votes
1 answer
142 views

I am trying to load a training dataset in my VS Code notebook but keep getting an error. This happens exclusively in VS Code, since when I run the same notebook in Colab there is no problem in loading....
Amrat nisa's user avatar
1 vote
1 answer
156 views

What file structure should I use on the Hugging Face Hub, if I have a /train.zip archive with PNG image files and an /metadata.csv file with annotations for them, so that the parquet-converter bot can ...
Artyom Ionash's user avatar
0 votes
1 answer
167 views

I have a number of datasets, which I create from a dictionary like so: info = DatasetInfo( description="my happy lil dataset", version="0.0.1", homepage=&...
MadDanWithABox's user avatar
2 votes
1 answer
3k views

I am encountering an ImportError when running a Python script that imports CommitInfo from the huggingface_hub package. The error message is as follows: ImportError: cannot import name 'CommitInfo' ...
Ohm's user avatar
  • 2,512
1 vote
0 answers
392 views

I'm using pip version 24.1.2 and Python 3.12.4. The installation seemingly goes fine. However, when importing the package, like in the line from datasets import load_dataset I'll see zsh: ...
ryanjackson's user avatar
0 votes
1 answer
218 views

I want to know which datasets are included in e.g. this collection of huggingface datasets: https://huggingface.co/datasets/autogluon/chronos_datasets "m4_daily" and "weatherbench_daily&...
ivegotaquestion's user avatar
1 vote
1 answer
5k views

Note: Newbie to LLM's Background of my problem I am trying to train a LLM using LLama3 on stackoverflow c langauge dataset. LLm - meta-llama/Meta-Llama-3-8B Dataset - Mxode/StackOverflow-QA-C-Language-...
Bhargav's user avatar
  • 4,911
0 votes
1 answer
107 views

I am trying to load the SQuAD dataset using the datasets library in Python, but I am encountering a FileNotFoundError. Here is the code I am using: from datasets import load_dataset dataset = ...
Rtttt's user avatar
  • 1
0 votes
1 answer
81 views

The Common Voice v11 on HuggingFace has some amazing View features! They include a dropdown button to select the language, and columns with the dataset features, such as client_id, audio, sentence, ...
Michel Mesquita's user avatar
0 votes
1 answer
339 views

I encountered a "hang" issue when using hg dataset's map(). I saw that using it while num_proc=os.cpu_count() should help, however when trying this I experienced the following error with the ...
SarahK's user avatar
  • 1
0 votes
1 answer
86 views

I have a CSV file, it has N columns, if I do datasets.Dataset.from_csv(path), it will be N feature each of them is a int value. However, I want to say: column-0 to column-4 is feature1, the rest is ...
Wang's user avatar
  • 8,436
0 votes
1 answer
409 views

To generate custom dataset from datasets import Dataset,ClassLabel,Value features = ({ "sentence1": Value("string"), # String type for sentence1 "sentence2": Value(&...
user269867's user avatar
  • 4,092
0 votes
1 answer
220 views

I have in-memory text, json format, and I am trying to load dataset (HuggingFace) directly from text in-memory. If I will save it into file - I can load the dataset using huggingface load_dataset: ...
Noam Gershi's user avatar
1 vote
0 answers
278 views

I'm working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don't want ...
Charlie Parker's user avatar
1 vote
0 answers
726 views

I'm currently working with the Hugging Face datasets library and need to apply transformations to multiple datasets (such as ds_khan and ds_mathematica) using the .map() function, but in a way that ...
Charlie Parker's user avatar
1 vote
1 answer
846 views

I'm running a function over a dataset, but when I compute this, I seem to replace my existing dataset rather than adding to it. What is going wrong? dataset_c = Dataset.from_pandas(df_all[0:100]) ...
disruptive's user avatar
  • 6,026
1 vote
2 answers
1k views

I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset. I was wondering if anyone might be able to help? Traceback (most recent call last): File "C:\Users\...
Connor Davidson's user avatar
0 votes
1 answer
1k views

After I have loaded a huggingface dataset download_config = DownloadConfig() dataset = load_dataset (hf_dataset_name, download_config=download_config) dataset_split = dataset ['train'] Let say if ...
 Hoo's user avatar
  • 93
0 votes
0 answers
61 views

Installed datasets package into python virtual environment. When I try to import it, running, from datasets import load_dataset, I get this error, "ImportError: cannot import name '...
Ju Chen's user avatar
-1 votes
1 answer
218 views

1. Objective The objective is to ensure the training data keeps the needed format for a model training. Using the SFTTrainer model training. The SFTTrainer has a parameter train_dataset=dataset, that ...
Thomas Suedbroecker's user avatar
1 vote
0 answers
354 views

I am trying to fine tune a model based on two datasets, following the example on the Hugging Face website, I have my model training on the Yelp Review dataset, but I also want to train my model on the ...
Bigbob556677's user avatar
  • 2,198
0 votes
1 answer
642 views

I'm following the huggingface tutorial here and it's giving me a strange error. When I run the following code: from datasets import load_dataset from transformers import AutoTokenizer, ...
Ameen Izhac's user avatar
2 votes
3 answers
4k views

I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows: pip install huggingface_hub[hf_transfer] huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --...
Franck Dernoncourt's user avatar
1 vote
1 answer
1k views

I have a dataset with 113287 train rows. Each 'caption' field is however an array with multiple strings. I would like to flatmap this array and add new rows. The documentation for datasets states that ...
Jotschi's user avatar
  • 3,682
0 votes
0 answers
477 views

I have tried to start environments with several different python versions and installed pyarrow in different versions. Nothing worked where can it be coming from? AttributeError ...
adigianv's user avatar
0 votes
1 answer
818 views

Following https://huggingface.co/docs/datasets/en/loading#json I am trying to load this dataset https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json ...
user25004's user avatar
  • 2,108
2 votes
1 answer
3k views

I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot ...
youtube's user avatar
  • 504
1 vote
1 answer
2k views

I am trying to finetune a facebook/wav2vec2 model on Automatic Speech Recognition (ASR) with common voice dataset, but I stumbled upon an issue that my disk space is not enough to hold this large ...
Philip's user avatar
  • 13
0 votes
1 answer
183 views

train_json_files = glob(paths.TRAIN_JSON_FOLDER + "*.json") from pathlib import Path def get_gt_string_and_xy(filepath: Union[str, os.PathLike]) -> Dict[str, str]: """ ...
Shadowpulse's user avatar

1
2 3 4 5 6