1. Model
-
The model can be understood as a function (a mapping rule), whose parameters are determined by training data. Once the parameters are determined, we can use the model to evaluate unknown data.
-
The input data for the model is called training data.
-
We use sample data to train the model, and each attribute in the data is called a feature (commonly represented by x). The target output value of each data is called a label (commonly represented by y)

2. Regression analysis
- Regression analysis is a statistical process used to evaluate the relationship between variables
- Used to explain the relationship between the independent variable X and the dependent variable Y, that is, how the dependent variable Y changes when the independent variable X changes.
2.1 Linear Regression
- Linear regression is a type of regression analysis in which there is a linear relationship between the evaluated independent variable X and the dependent variable Y (the graph drawn is straight, with the highest term for each independent variable being 1).
- Linear regression will output a continuous value
The function of linear regression can be represented by the following formula, where x is the independent variable (x0 is 1), and w is the weight of the corresponding independent variable (w0 is the intercept):

2.2 Fitting
- Fitting refers to constructing an algorithm (mathematical function) that can fit real data.
- From the perspective of machine learning, linear regression is to construct a linear function that best matches the target value.
- From a spatial perspective, it is necessary to make the line (surface) of the function as close as possible to all data points in space (the sum of distances parallel to the y-axis from the point to the line is the shortest).
- Loss function, a function of error used to measure the difference between a model's predicted value and the true value.
- The goal of machine learning is to establish a loss function that minimizes its value.
- The loss function is a function that takes the model parameter w as the independent variable, and the possible combinations of independent variable values are usually infinite. Our goal is to find the most suitable set of independent variable combinations (values) among the many possible combinations, so as to minimize the value of the loss function.
In linear regression, the least squares method is used to define the loss function:

3. Regression Model Evaluation
- After establishing the regression model, we can use indicators such as MSE, RMSE, MAE, R ^ 2, etc. to evaluate the effectiveness of the model.
- MSE (Mean Squared Error): The average squared error is the sum of the squares of all sample data errors (the difference between the true and predicted values), and then the mean is taken.
- RMSE (Root Mean Squared Error): The square root of the average squared error, which is taken based on MSE.
- MAE (Mean Absolute Error): The average absolute error is the sum of the absolute values of all sample data errors, and then the mean is taken.
- R ^ 2: Coefficient of determination, used to represent the score of model fit, with higher values indicating better model fit.
- In the training set, the range of values for R ^ 2 is [0,1].
- In the test set, the value range of R ^ 2 is negative infinity to 1. The most ideal scenario is that the predicted values of all sample data are the same as the true values, with a value of 1.
4. Simple linear regression
- When there is only one independent variable, it is called simple linear regression.
Using the Boston housing price dataset as an example, predict the relationship between RM (average number of rooms) and MEDV (average price of houses):
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error, mean_absolute_error
boston = load_boston() #load boston house price dataset
#splicing data information and target information
data = np.concatenate([boston.data,boston.target.reshape(-1,1)],axis=1)
feature_names = boston.feature_names.tolist()
feature_names.append("MEDV") #take"MEDV"add the average price of the house to the list of feature names as column names for concatenated data
df = pd.DataFrame(data,columns=feature_names)
df.sample(5) #randomly sample 10 pieces of data for data overview
x, y = boston.data[:,5].reshape(-1,1), boston.target #the average number of rooms is taken as x the average price of a house is taken as y
#divide training and testing sets, test set size test_size , random seeds random_state , used to generate the same sequence of random numbers
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
lr = LinearRegression() #instantiating classes for linear regression
lr.fit(x_train, y_train) #using training set data to train the model
print("model weight :", lr.coef_)
print("intercept :", lr.intercept_)
y_hat = lr.predict(x_test) #the function can be determined based on weights and intercepts, and the test set can be used for prediction
print("actual value :", y_test[:5])
print("estimate :", y_hat[:5])
print("mean squared error( MSE):", mean_squared_error(y_test, y_hat))
print("root mean square error( RMSE):", np.sqrt(mean_squared_error(y_test, y_hat)))
print("mean absolute error( MAE):",mean_absolute_error(y_test, y_hat))
print("training set R^2:",lr.score(x_train, y_train))
print("test set R^2:",lr.score(x_test, y_test))
For the above example, the result is:
model weight : [9.31294923]
intercept : -36.180992646339185
actual value : [22.6 50. 23. 8.3 21.2]
estimate : [22.7979148 21.70829974 23.17043277 13.63397276 21.85730693]
mean squared error( MSE): 43.472041677202206
root mean square error( RMSE): 6.593333123481795
mean absolute error( MAE): 4.212526305455822
training set R^2: 0.48752067939343646
test set R^2: 0.46790005431367815
5. Multiple linear regression
- When there are multiple independent variables, it is called multiple linear regression.
Using the Boston housing price dataset as an example, predict the relationship between all influencing factors and MEDV (average house price):
x, y = boston.data, boston.target #all influencing factors as x the average price of a house is as follows: y
#divide training and testing sets, test set size test_size , random seeds random_state , used to generate the same sequence of random numbers
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
lr = LinearRegression() #instantiating classes for linear regression
lr.fit(x_train, y_train) #using training set data to train the model
print("model weight :", lr.coef_)
print("intercept :", lr.intercept_)
y_hat = lr.predict(x_test)
print("actual value :", y_test[:5])
print("estimate :", y_hat[:5])
print("mean squared error( MSE):", mean_squared_error(y_test, y_hat))
print("root mean square error( RMSE):", np.sqrt(mean_squared_error(y_test, y_hat)))
print("mean absolute error( MAE):", mean_absolute_error(y_test, y_hat))
print("training set R^2:",lr.score(x_train, y_train))
print("test set R^2:",lr.score(x_test, y_test))
For the above example, the result is:
model weight : [-1.17735289e-01 4.40174969e-02 -5.76814314e-03 2.39341594e+00
-1.55894211e+01 3.76896770e+00 -7.03517828e-03 -1.43495641e+00
2.40081086e-01 -1.12972810e-02 -9.85546732e-01 8.44443453e-03
-4.99116797e-01]
intercept : 36.933255457118925
actual value : [22.6 50. 23. 8.3 21.2]
estimate : [24.95233283 23.61699724 29.20588553 11.96070515 21.33362042]
mean squared error( MSE): 29.78224509230234
root mean square error( RMSE): 5.457311159564052
mean absolute error( MAE): 3.668330148135715
training set R^2: 0.7697699488741149
test set R^2: 0.6354638433202132
From the examples of simple linear regression and multiple linear regression above, it can be seen that:
- The methods for solving simple linear regression and multiple linear regression are the same, with the only difference being that the data is different (multiple linear regression has multiple dependent variables).
- For the Boston housing price dataset, combining all factors is better than predicting based solely on a single factor.