Newest 'data-preprocessing' Questions

0 votes

0 answers

26 views

Assistance with Data Processing Insurance Premiums

I have been set a task by my manager to try and predict insurance premiums based on some categories such as job description, number of people employed and turnover. I am comparing between K-Nearest ...

Red_bull

19

asked Jul 29 at 13:57

0 votes

0 answers

27 views

Unabel to create kafka topics for MongoDB connector

I am trying to use MongoDB Kafka official connector to create topics automatically while creating connector using sql command CREATE SOURCE CONNECTOR logistics_n WITH ( 'connector.class' = 'com....

Roll no1

1,423

asked Jul 11 at 9:46

0 votes

1 answer

58 views

Multivalued column cannot be transformed

Im working with Stackoverflow 2024 survey. In the csv file there are several multivalued variables (separated by ;). I want to apply One-hot encoding to the variables Employment and LanguageAdmire by ...

Lev

843

asked Jun 3 at 10:42

0 votes

0 answers

21 views

NaN Values After Applying IterativeImputer and Inverse Transforming LabelEncoded Data

I am using IterativeImputer from sklearn.impute to fill missing values in my dataset. One of my columns, Education_Level, is a categorical feature, so I first applied LabelEncoder to convert it into ...

Mahdi Mashayekhi

1

asked Mar 31 at 17:17

0 votes

0 answers

18 views

Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?

I am working on a binary classification task using an audio dataset, which is already divided into training and testing sets. However, I also need a validation set, so I split the training set into ...

GauravGiri

21

asked Mar 1 at 4:52

0 votes

2 answers

54 views

Combining multiple dataframes with same number of rows and different columns in R [duplicate]

I'm trying to combine several (>2) dataframes with the same rows and different columns in R. For example, I have 4 dataframes: df1 <- data.frame( x = c("A1", "A2", "A3&...

Karina

asked Feb 28 at 17:30

0 votes

1 answer

49 views

Is there a way to set the data_min and the data_max in MinMaxScaler()?

I'm currently using MinMaxScaler() on my dataset. However, because my dataset is large I'm doing a first iteration pass in batches to compute the Min and Max Values for my Scaler. i'm using ...

Saffy

13

asked Feb 5 at 21:57

0 votes

0 answers

69 views

Downloading MIT-BIH NSR & SCD Holter Databases from PhysioNet in Python

I am working on a deep learning project to forecast Sudden Cardiac Death (SCD) using ECG data from PhysioNet. Specifically, I need to download and preprocess the following databases: MIT-BIH Normal ...

lipano marte

1

asked Jan 31 at 15:41

0 votes

0 answers

18 views

How to combine columns with nested lists with each other using pandas? [duplicate]

I'm working on a padas DataFrame that contains columns with lists and currently trying the method explode, but I'm not getting the desired output, instead, it does a Cartesian Product, combining all ...

buzzo

1

asked Jan 14 at 14:58

0 votes

0 answers

54 views

How can I batch process multiple .npy files in Python for motion capture data preprocessing?

I am working on a project where I need to preprocess multiple motion capture files stored in .npy format. I am able to load and preprocess individual files, but I am facing difficulties when trying to ...

Mathletes Choreo

1

asked Dec 12, 2024 at 9:51

2 votes

0 answers

66 views

kernel died when I run : dataset = Dataset.from_dict(data_dict)

I am fine-tuning sam model for my dataset containing train_images and train_masks. I am able to create dict, but when calling last command i.e. to load dataset from dict, kernel dies. It happened ...

Sanju

21

asked Dec 11, 2024 at 10:39

0 votes

1 answer

539 views

How to create a scaler applying log transformation and MinMaxScaler in sklearn

I want to apply log() to my DataFrame and MinMaxScaler() together. I want the output to be a pandas DataFrame() with indexes and columns from the original data. I want to use the parameters used to ...

Guilherme Parreira

1,071

asked Nov 7, 2024 at 18:41

0 votes

1 answer

70 views

Varying embedding dim due to changing padding in batch size

I want to train a simple neural network, which has embedding_dim as a parameter: class BoolQNN(nn.Module): def __init__(self, embedding_dim): super(BoolQNN, self).__init__() self....

samuel gast

392

asked Oct 18, 2024 at 15:54

0 votes

0 answers

72 views

Input file specified two times

I am using shell in Jupyter with Python programming Language. When I use to prepare a dataset, I fail to complete it on sorting by column and case sensitive. The line is like this: !head -n 5 $...

md Almus Fuad

1

asked Sep 19, 2024 at 14:38

-1 votes

1 answer

191 views

Capitalized words in sentiment analysis

I'm currently working with data of customers reviews on products from Sephora. my task to classify them to sentiments : negative, neutral , positive . A common technique of text preprocessing is to ...

read data

3

asked Aug 30, 2024 at 13:49

2 votes

1 answer

74 views

How to Skip over Consecutive Delimeters in Preprocessing Data

I'm trying to clean up an ASCII dataset with inconsistent spacing (ex. dataset = \[1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 2 1 1 1 1 1 1 1\]) but so far what I've ...

Daedalus

21

asked Aug 27, 2024 at 18:19

1 vote

0 answers

27 views

how can I transform the categorical data entered by the user using Target Encoding?

When fitting the model in google collab there doesnt seem to be any problem. However, when I try to create an interface using streamlit and pickle, Target encoder doesnt work and I am unable to solve ...

user25546188

11

asked Aug 22, 2024 at 22:19

0 votes

1 answer

81 views

ModuleNotFoundError: No module named 'datachain.lib'; 'datachain' is not a package

Why am I encountering the ModuleNotFoundError for the datachain.lib module? Are there any additional steps I need to take to properly use the datachain package in my project? I'm working on a Python ...

Rashid mehmood

29

asked Aug 7, 2024 at 9:53

0 votes

0 answers

56 views

How can I preprocess a feature that contains a list of number codes?

I have to preprocess a feature which is basically a list of number codes enocoded as a string, and I want to encode it such that the output is an array of frequencies of each of these numbers. The ...

AKHIL GOPIKUMAR

1

asked Jul 27, 2024 at 15:30

0 votes

1 answer

81 views

How to combine 3 annotated datasets into one file for further processing?

I have a dataset annotated by three people, so now I have three files. This dataset is about tweets annotation. How can I combine this dataset into one file for further processing. The data set is an ...

ZAIN UL ABIDIN QADRI

1

asked Jul 4, 2024 at 7:05

1 vote

2 answers

707 views

How can I create a custom sigmoid function?

I am trying to build a custom sigmoid-shaped function because I want to scale my data during preprocessing. Basically, the goal is to obtain a sigmoid shaped function that outputs from 0 to 1 and only ...

cercio

89

asked Jun 25, 2024 at 10:32

1 vote

1 answer

775 views

What is the alternative for keras.layers.DenseFeatures in TensorFlow 2.16.+

I am using feature-column dataset in my code, in newer version of TensorFlow 2.16.1 and later there is no keras.layers.DenseFeatures class in order to ready the input layer for the DNN. what is the ...

shahramy

13

asked Jun 22, 2024 at 1:26

1 vote

0 answers

90 views

How do I ensure unique non-overlapping values in each column?

I have the following input: data = { 'Group_A': ['0&1', '1&5', '0&5', '1&7', '3&8', '4&8', '3&5', '4&4'], 'Group_B': ['1&0', '5&7', '0&5'...

deepcurious

19

asked Jun 13, 2024 at 7:16

1 vote

3 answers

124 views

Classification for multi row observation: Long format to Wide format always efficient?

I have a table of observations, or rather 'grouped' observations, where each group represents a deal, and each row representing a product. But the prediction is to be done at a Deal level. Below is ...

Salih

399

asked May 31, 2024 at 10:43

0 votes

1 answer

868 views

SageMaker Processing Job permission denied to save csv file under /opt/ml/processing/<folder>

I am working on a project involving Step Functions with SageMaker. I have an existing Step Function that I need to integrate SageMaker into, and I tried adding steps such as processing, model training,...

Gwenda Thomas

101

asked May 29, 2024 at 19:34

-2 votes

1 answer

33 views

How should different numbers of explanatory variables be handled by row in machine learning?

I have a problem while making a predictive model, so I'm leaving a question. I'm trying to create a predictive model using machine learning methodologies such as random forest, xgboost, etc. At this ...

최성렬

1

asked May 28, 2024 at 11:21

-4 votes

1 answer

64 views

Is there an excel function to assign a binary result to a predefine data cell?

Sorry for the title, I know it might be pretty wide and not so much informative. I am facing a problem regarding the analysis of a data set. The participants of my experiments were randomly assigned ...

taboulet

1

asked May 20, 2024 at 14:53

0 votes

0 answers

95 views

Deduplication of text for a large corpus

I have a large csv file with about 7000 rows (files) with text entries consisting of the following columns in bold: filename title text author year 0 latin_xmls\10.xml De facto Ungarie ...

Phil

1

asked May 7, 2024 at 16:27

0 votes

1 answer

382 views

Filtering Pandas DataFrame by Substring Match at Start of Strings [duplicate]

Trying to filter out rows in which the data of specific column start with a given substring. I have a pandas.DataFrame as shown below (simplified): price DRUG_CODE 123 A12D958 234 B564F3C ... ... I'm ...

Warren Chen

51

asked May 3, 2024 at 10:02

0 votes

0 answers

28 views

Mock testing in Python shows an error "no module found named path.csv"

class TestFilterDF(unittest.TestCase): @patch('plugins.qa_plugins.preprocessing.read_df') def test_filter_df(self, read_df_mock): # Mocking read_df function to return a DataFrame ...

Uplabdhi Khare

1

asked Apr 30, 2024 at 12:59

0 votes

1 answer

35 views

Sklearn Column Transformer not working for mixed data types

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder from sklearn.pipeline import Pipeline from sklearn.model_selection import ...

s213439

1

asked Apr 30, 2024 at 8:20

0 votes

1 answer

164 views

Machine Learning model dependent on one feature

I am training a model for crop yield prediction having a self-constructed dataset of 6 features and 2000 records. However, the dataset is biased and I am not getting accurate results. I have tried ...

Muhammad Bilal

9

asked Apr 28, 2024 at 9:27

0 votes

0 answers

41 views

Failed to convert a NumPy array to a Tensor for LSTM

Trying to run an LSTM model where the data is separated into few columns in csv and i'm trying to prepare date from such csv's. Getting the error of ValueError: Failed to convert a NumPy array to a ...

Athul Srinivas

36

asked Apr 25, 2024 at 15:22

0 votes

1 answer

49 views

Why do I get "inf" values after replacing NaN values in my dataset with the mean or median?

I'm using Python and I have a dataset containing NaN values. To clean up these data, I replaced the NaN values with the mean or median of each column using the fillna() function from pandas. However, ...

AI enthousiast

19

asked Apr 24, 2024 at 11:45

1 vote

1 answer

2k views

Why is my GPU not being used despite having turned it on in Kaggle?

I've uploaded a dataset on kaggle(approx. 73GB), and I'm trying to preprocess this data for model training purposes. This dataset has a large no. of missing values, which I am trying to interpolate ...

54m4gr4

13

asked Apr 23, 2024 at 12:54

0 votes

0 answers

142 views

ValueError: could not convert string to float: 'M'

I'm trying to make an ANN in Python to predict something from a dataset (in this case diabetes), and I'm struggling to figure out how to solve this error. Here is the full code: import pandas as pd ...

nyura45

1

asked Apr 23, 2024 at 8:01

0 votes

1 answer

638 views

Issue when padding and packing sequences in LSTM networks using PyTorch

I'm trying to make a simple lstm neural network. I've got time series data which I am splitting into sequences and batches using Pytorch's Dataset and DataLoader. To account for the variable lengths ...

D Danne

17

asked Apr 18, 2024 at 19:35

0 votes

0 answers

88 views

How to pass arguments(labels) to map function map_func, callable from tf.data.Dataset.map() in Python

There is known method how to create dataset: CODE snippet was borrowed from: https://www.tensorflow.org/tutorials/audio/simple_audio #Gather data from files ''' .....some code I see no need to paste, ...

Hell576

1

asked Apr 13, 2024 at 10:35

0 votes

0 answers

61 views

TypeError: Cannot do positional indexing on RangeIndex with these indexers of type DataFrame

I'm new with python so I'm sorry if this is a basic one. However, after I ran the code, I got this: TypeError: cannot do positional indexing on RangeIndex with these indexers [ Year Average of PM ...

Sofia

1

asked Apr 6, 2024 at 10:14

0 votes

1 answer

66 views

Using tft.scale_to_gaussian for preprocessing a dataset without using other tensorflow operations

I'm working on a project where I have a set of longtail data that I want to transform into a Gaussian distribution. I'm looking to achieve something similar to scikit-learn's PowerTransformer, but ...

umut

1

asked Mar 26, 2024 at 18:00

0 votes

1 answer

107 views

Feature Scaling with MinMaxScaler()

I have 31 features to be input into an ML algorithm. Of these 22 feature values are in the range of 0 to 1 already. The remaining 9 features vary between 0 to 750. My doubt is if I choose to apply ...

rekha

7

asked Mar 19, 2024 at 5:40

0 votes

1 answer

35 views

How to separate items in dataset in python?

I scraped reviews from a web and there are pros and cons separate from each other. I scraped them as a list because it looks like as the best solution for not having the same review with user, date ...

averzeo

1

asked Mar 18, 2024 at 17:40

1 vote

1 answer

38 views

Using sklearn where the label a combination of multiple inputs [closed]

I'm performing data analysis on a dataset with categorical labels are interrelated. My labels track experimental conditions. In my case, labels track concentrations of combinations of two chemicals ...

WoolyThomas

47

asked Mar 12, 2024 at 22:39

12 votes

2 answers

81 views

Issues with Data Preprocessing and Changing Type of DataFrame Columns

I defined student_sub_set dataframe as below: # select the subset of characteristics for the regression student_sub_set = student[['acad_lang_home', 'absent_freq','tired_freq','sex', ...

Narges Ghanbari

433

asked Mar 10, 2024 at 23:36

0 votes

0 answers

100 views

Sklearn inverse_transformation does not work as expected, any alternatives?

from sklearn.preprocessing import MinMaxScaler values = df[['Close']] #values is floats ranging from 0.06 to 190.08 sc = MinMaxScaler() scaled_values = sc.fit_transform(values) descaled_values = sc....

haintaki

11

asked Mar 8, 2024 at 0:43

0 votes

1 answer

71 views

How to Extract specific data from Text File Using Python

I have a data file that has a geometrical combination as the heading and the following related data generated from the software. The data file has the following structure. The data file start from ...

Mad0731

1

asked Feb 29, 2024 at 20:53

0 votes

0 answers

64 views

Is there a faster method to process pandas list of string values

There are 13000 values approximately for a given column. The below function works in a way that the input is a list of strings and does the NER tagging for each word in the list. On an average there ...

srinivas muralidharan

39

asked Feb 14, 2024 at 10:38

1 vote

0 answers

878 views

How do LlamaIndex and LangChain Differ in Terms of Data Preprocessing for LLM Applications?

I've been exploring frameworks to integrate large language models (LLMs) into my applications, specifically focusing on data preprocessing, ingestion, and query capabilities. I've come across both ...

Arrmlet

121

asked Feb 5, 2024 at 14:50

0 votes

0 answers

93 views

Worse performance with increased direct_num_workers when running preprocessing of DLRM with Apache Beam

I am now trying to run preprocessing tasks of DLRM with Apache Beam https://github.com/tensorflow/models/tree/master/official/recommendation/ranking/preprocessing. The dataset is Criteo Kaggle 10GB ...

Eric

1

asked Jan 29, 2024 at 8:36

2 votes

1 answer

8k views

How to Train a YOLO Model with Locally Downloaded Open Images Dataset?

I have downloaded the Open Images dataset to train a YOLO (You Only Look Once) model for a computer vision project. However, I am facing some challenges and I am seeking guidance on how to proceed. ...

Ameer Hamzah

31

asked Jan 21, 2024 at 14:33

Collectives™ on Stack Overflow