Data normalization to use as input in ML algorithms

I would like to get some suggestions about the best way to normalize my dataset to use as ML inputs.

My dataset looks like this:

    -------------------------------------------------------------------------
    |   date   | holiday | weekday |  type  | max_temp | min_temp |   qty   |
    -------------------------------------------------------------------------
 1  | 01/31/22 |    0    |   tue   | casual |   35.25  |  23.44   |  1,358  |
 2  | 07/02/21 |    1    |   mon   | member |   34.33  |   7.29   |  1,358  |    
 3  | 03/12/20 |    0    |   sat   | casual |   12.21  |   2.18   |  1,358  |    
... 
 n

I'm using Python to clean the data, and I intend to use this dataset to apply some linear regression, random forests, and XGBoost algorithm to predict the last column (qty).

Any suggestions for the best practice to prepare my data?

asked Sep 15, 2022 at 23:19

Curiel

395 bronze badges

From my previous experience, I know that it's possible to turn all data into dummies, but that is the really the best practice? I think about to create more columns and fill with 0 and 1 for the different classifications of the categorical data, but i am not certainly about increasing the size of the dataset.

Curiel
– Curiel

2022-09-15 23:22:35 +00:00
Commented Sep 15, 2022 at 23:22
1

The question is too broad. I recommend you the book hands on Machine Learning from Aurelien Geron. Regards.

Luis Alejandro Vargas Ramos
– Luis Alejandro Vargas Ramos

2022-09-15 23:57:58 +00:00
Commented Sep 15, 2022 at 23:57
Searching, I found that a good way is to convert categorical data into dummy variables with the "pd.get_dummies()". I will try to do this.

Curiel
– Curiel

2022-09-16 00:59:56 +00:00
Commented Sep 16, 2022 at 0:59
There are specific functions for that in scikit learn. Have a look to ordinal encoder, one hot encoder, label encoder. Each one has a specific use. Regards.

Luis Alejandro Vargas Ramos
– Luis Alejandro Vargas Ramos

2022-09-16 01:48:30 +00:00
Commented Sep 16, 2022 at 1:48
Thank you @LuisAlejandroVargasRamos, for my dataset, I think that the most adequate is to turn the categorial data into dummies through "one hot encoder". I tried that on my notebook and looks good. I am going now to see if the ML model runs OK with that methodology.

Curiel
– Curiel

2022-09-16 02:29:17 +00:00
Commented Sep 16, 2022 at 2:29

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Data normalization to use as input in ML algorithms

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest