495 questions
0
votes
0
answers
26
views
Assistance with Data Processing Insurance Premiums
I have been set a task by my manager to try and predict insurance premiums based on some categories such as job description, number of people employed and turnover. I am comparing between K-Nearest ...
0
votes
0
answers
27
views
Unabel to create kafka topics for MongoDB connector
I am trying to use MongoDB Kafka official connector to create topics automatically while creating connector using sql command
CREATE SOURCE CONNECTOR logistics_n WITH (
'connector.class' = 'com....
0
votes
1
answer
58
views
Multivalued column cannot be transformed
Im working with Stackoverflow 2024 survey. In the csv file there are several multivalued variables (separated by ;). I want to apply One-hot encoding to the variables Employment and LanguageAdmire by ...
0
votes
0
answers
21
views
NaN Values After Applying IterativeImputer and Inverse Transforming LabelEncoded Data
I am using IterativeImputer from sklearn.impute to fill missing values in my dataset. One of my columns, Education_Level, is a categorical feature, so I first applied LabelEncoder to convert it into ...
0
votes
0
answers
18
views
Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?
I am working on a binary classification task using an audio dataset, which is already divided into training and testing sets. However, I also need a validation set, so I split the training set into ...
0
votes
2
answers
54
views
Combining multiple dataframes with same number of rows and different columns in R [duplicate]
I'm trying to combine several (>2) dataframes with the same rows and different columns in R.
For example, I have 4 dataframes:
df1 <- data.frame(
x = c("A1", "A2", "A3&...
0
votes
1
answer
49
views
Is there a way to set the data_min and the data_max in MinMaxScaler()?
I'm currently using MinMaxScaler() on my dataset. However, because my dataset is large I'm doing a first iteration pass in batches to compute the Min and Max Values for my Scaler. i'm using ...
0
votes
0
answers
69
views
Downloading MIT-BIH NSR & SCD Holter Databases from PhysioNet in Python
I am working on a deep learning project to forecast Sudden Cardiac Death (SCD) using ECG data from PhysioNet. Specifically, I need to download and preprocess the following databases:
MIT-BIH Normal ...
0
votes
0
answers
18
views
How to combine columns with nested lists with each other using pandas? [duplicate]
I'm working on a padas DataFrame that contains columns with lists and currently trying the method explode, but I'm not getting the desired output, instead, it does a Cartesian Product, combining all ...
0
votes
0
answers
54
views
How can I batch process multiple .npy files in Python for motion capture data preprocessing?
I am working on a project where I need to preprocess multiple motion capture files stored in .npy format. I am able to load and preprocess individual files, but I am facing difficulties when trying to ...
2
votes
0
answers
66
views
kernel died when I run : dataset = Dataset.from_dict(data_dict)
I am fine-tuning sam model for my dataset containing train_images and train_masks. I am able to create dict, but when calling last command i.e. to load dataset from dict, kernel dies. It happened ...
0
votes
1
answer
539
views
How to create a scaler applying log transformation and MinMaxScaler in sklearn
I want to apply log() to my DataFrame and MinMaxScaler() together.
I want the output to be a pandas DataFrame() with indexes and columns from the original data.
I want to use the parameters used to ...
0
votes
1
answer
70
views
Varying embedding dim due to changing padding in batch size
I want to train a simple neural network, which has embedding_dim as a parameter:
class BoolQNN(nn.Module):
def __init__(self, embedding_dim):
super(BoolQNN, self).__init__()
self....
0
votes
0
answers
72
views
Input file specified two times
I am using shell in Jupyter with Python programming Language. When I use to prepare a dataset, I fail to complete it on sorting by column and case sensitive.
The line is like this:
!head -n 5 $...
-1
votes
1
answer
191
views
Capitalized words in sentiment analysis
I'm currently working with data of customers reviews on products from Sephora. my task to classify them to sentiments : negative, neutral , positive .
A common technique of text preprocessing is to ...
2
votes
1
answer
74
views
How to Skip over Consecutive Delimeters in Preprocessing Data
I'm trying to clean up an ASCII dataset with inconsistent spacing (ex.
dataset =
\[1 1 1 1 1 1 1 1
1 1 1 1 1 1 4
2 1 1 1 1 1 1 1\])
but so far what I've ...
1
vote
0
answers
27
views
how can I transform the categorical data entered by the user using Target Encoding?
When fitting the model in google collab there doesnt seem to be any problem. However, when I try to create an interface using streamlit and pickle, Target encoder doesnt work and I am unable to solve ...
0
votes
1
answer
81
views
ModuleNotFoundError: No module named 'datachain.lib'; 'datachain' is not a package
Why am I encountering the ModuleNotFoundError for the datachain.lib module?
Are there any additional steps I need to take to properly use the datachain package in my project?
I'm working on a Python ...
0
votes
0
answers
56
views
How can I preprocess a feature that contains a list of number codes?
I have to preprocess a feature which is basically a list of number codes enocoded as a string, and I want to encode it such that the output is an array of frequencies of each of these numbers. The ...
0
votes
1
answer
81
views
How to combine 3 annotated datasets into one file for further processing?
I have a dataset annotated by three people, so now I have three files. This dataset is about tweets annotation. How can I combine this dataset into one file for further processing. The data set is an ...
1
vote
2
answers
707
views
How can I create a custom sigmoid function?
I am trying to build a custom sigmoid-shaped function because I want to scale my data during preprocessing. Basically, the goal is to obtain a sigmoid shaped function that outputs from 0 to 1 and only ...
1
vote
1
answer
775
views
What is the alternative for keras.layers.DenseFeatures in TensorFlow 2.16.+
I am using feature-column dataset in my code, in newer version of TensorFlow 2.16.1 and later there is no keras.layers.DenseFeatures class in order to ready the input layer for the DNN. what is the ...
1
vote
0
answers
90
views
How do I ensure unique non-overlapping values in each column?
I have the following input:
data = {
'Group_A': ['0&1', '1&5', '0&5', '1&7', '3&8', '4&8', '3&5', '4&4'],
'Group_B': ['1&0', '5&7', '0&5'...
1
vote
3
answers
124
views
Classification for multi row observation: Long format to Wide format always efficient?
I have a table of observations, or rather 'grouped' observations, where each group represents a deal, and each row representing a product. But the prediction is to be done at a Deal level. Below is ...
0
votes
1
answer
868
views
SageMaker Processing Job permission denied to save csv file under /opt/ml/processing/<folder>
I am working on a project involving Step Functions with SageMaker. I have an existing Step Function that I need to integrate SageMaker into, and I tried adding steps such as processing, model training,...
-2
votes
1
answer
33
views
How should different numbers of explanatory variables be handled by row in machine learning?
I have a problem while making a predictive model, so I'm leaving a question.
I'm trying to create a predictive model using machine learning methodologies such as random forest, xgboost, etc.
At this ...
-4
votes
1
answer
64
views
Is there an excel function to assign a binary result to a predefine data cell?
Sorry for the title, I know it might be pretty wide and not so much informative. I am facing a problem regarding the analysis of a data set. The participants of my experiments were randomly assigned ...
0
votes
0
answers
95
views
Deduplication of text for a large corpus
I have a large csv file with about 7000 rows (files) with text entries consisting of the following columns in bold:
filename title text author year
0 latin_xmls\10.xml De facto Ungarie ...
0
votes
1
answer
382
views
Filtering Pandas DataFrame by Substring Match at Start of Strings [duplicate]
Trying to filter out rows in which the data of specific column start with a given substring.
I have a pandas.DataFrame as shown below (simplified):
price
DRUG_CODE
123
A12D958
234
B564F3C
...
...
I'm ...
0
votes
0
answers
28
views
Mock testing in Python shows an error "no module found named path.csv"
class TestFilterDF(unittest.TestCase):
@patch('plugins.qa_plugins.preprocessing.read_df')
def test_filter_df(self, read_df_mock):
# Mocking read_df function to return a DataFrame ...
0
votes
1
answer
35
views
Sklearn Column Transformer not working for mixed data types
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ...
0
votes
1
answer
164
views
Machine Learning model dependent on one feature
I am training a model for crop yield prediction having a self-constructed dataset of 6 features and 2000 records.
However, the dataset is biased and I am not getting accurate results. I have tried ...
0
votes
0
answers
41
views
Failed to convert a NumPy array to a Tensor for LSTM
Trying to run an LSTM model where the data is separated into few columns in csv and i'm trying to prepare date from such csv's.
Getting the error of
ValueError: Failed to convert a NumPy array to a ...
0
votes
1
answer
49
views
Why do I get "inf" values after replacing NaN values in my dataset with the mean or median?
I'm using Python and I have a dataset containing NaN values. To clean up these data, I replaced the NaN values with the mean or median of each column using the fillna() function from pandas. However, ...
1
vote
1
answer
2k
views
Why is my GPU not being used despite having turned it on in Kaggle?
I've uploaded a dataset on kaggle(approx. 73GB), and I'm trying to preprocess this data for model training purposes. This dataset has a large no. of missing values, which I am trying to interpolate ...
0
votes
0
answers
142
views
ValueError: could not convert string to float: 'M'
I'm trying to make an ANN in Python to predict something from a dataset (in this case diabetes), and I'm struggling to figure out how to solve this error.
Here is the full code:
import pandas as pd
...
0
votes
1
answer
638
views
Issue when padding and packing sequences in LSTM networks using PyTorch
I'm trying to make a simple lstm neural network. I've got time series data which I am splitting into sequences and batches using Pytorch's Dataset and DataLoader. To account for the variable lengths ...
0
votes
0
answers
88
views
How to pass arguments(labels) to map function map_func, callable from tf.data.Dataset.map() in Python
There is known method how to create dataset:
CODE snippet was borrowed from: https://www.tensorflow.org/tutorials/audio/simple_audio
#Gather data from files
'''
.....some code I see no need to paste, ...
0
votes
0
answers
61
views
TypeError: Cannot do positional indexing on RangeIndex with these indexers of type DataFrame
I'm new with python so I'm sorry if this is a basic one. However, after I ran the code, I got this:
TypeError: cannot do positional indexing on RangeIndex with these indexers [ Year Average of PM ...
0
votes
1
answer
66
views
Using tft.scale_to_gaussian for preprocessing a dataset without using other tensorflow operations
I'm working on a project where I have a set of longtail data that I want to transform into a Gaussian distribution. I'm looking to achieve something similar to scikit-learn's PowerTransformer, but ...
0
votes
1
answer
107
views
Feature Scaling with MinMaxScaler()
I have 31 features to be input into an ML algorithm. Of these 22 feature values are in the range of 0 to 1 already. The remaining 9 features vary between 0 to 750. My doubt is if I choose to apply ...
0
votes
1
answer
35
views
How to separate items in dataset in python?
I scraped reviews from a web and there are pros and cons separate from each other. I scraped them as a list because it looks like as the best solution for not having the same review with user, date ...
1
vote
1
answer
38
views
Using sklearn where the label a combination of multiple inputs [closed]
I'm performing data analysis on a dataset with categorical labels are interrelated.
My labels track experimental conditions.
In my case, labels track concentrations of combinations of two chemicals ...
12
votes
2
answers
81
views
Issues with Data Preprocessing and Changing Type of DataFrame Columns
I defined student_sub_set dataframe as below:
# select the subset of characteristics for the regression
student_sub_set = student[['acad_lang_home', 'absent_freq','tired_freq','sex',
...
0
votes
0
answers
100
views
Sklearn inverse_transformation does not work as expected, any alternatives?
from sklearn.preprocessing import MinMaxScaler
values = df[['Close']] #values is floats ranging from 0.06 to 190.08
sc = MinMaxScaler()
scaled_values = sc.fit_transform(values)
descaled_values = sc....
0
votes
1
answer
71
views
How to Extract specific data from Text File Using Python
I have a data file that has a geometrical combination as the heading and the following related data generated from the software. The data file has the following structure.
The data file start from ...
0
votes
0
answers
64
views
Is there a faster method to process pandas list of string values
There are 13000 values approximately for a given column. The below function works in a way that the input is a list of strings and does the NER tagging for each word in the list. On an average there ...
1
vote
0
answers
878
views
How do LlamaIndex and LangChain Differ in Terms of Data Preprocessing for LLM Applications?
I've been exploring frameworks to integrate large language models (LLMs) into my applications, specifically focusing on data preprocessing, ingestion, and query capabilities. I've come across both ...
0
votes
0
answers
93
views
Worse performance with increased direct_num_workers when running preprocessing of DLRM with Apache Beam
I am now trying to run preprocessing tasks of DLRM with Apache Beam https://github.com/tensorflow/models/tree/master/official/recommendation/ranking/preprocessing. The dataset is Criteo Kaggle 10GB ...
2
votes
1
answer
8k
views
How to Train a YOLO Model with Locally Downloaded Open Images Dataset?
I have downloaded the Open Images dataset to train a YOLO (You Only Look Once) model for a computer vision project. However, I am facing some challenges and I am seeking guidance on how to proceed. ...