1. Basic concepts:
Support Vector Machine is finding a plane that can split the training sample in half. It can be seen that SVM is used for binary classification.
If there exists a plane that can divide the sample in half without any error, it is called linearly separable. If there is an individual misclassification, it is called linear indivisibility.
1.2 Dataset:
T= {(x1, y1)... (xN, yN)} T= {(x_1, y1). (x-N, y-N) } T= {(x1, y1)... (xN, yN)}
Where x ∈ Rmx in Bbb R ^ mx ∈ Rm is an m-dimensional feature, y is a label, where y ∈ {− 1,+ 1} y in {-1,+ 1 } y ∈ {− 1,+ 1}.
Assuming the optimal segmentation plane we are looking for:
W ∗ x+ B *= 0w ^ * cdot x+ B ^ *= 0w ∗ x+ B *= 0.
* ^ * * * represents the optimal parameter.
We bring a new feature to the left of the equal sign. If it is greater than 0, it indicates a positive class, and if it is less than 0, it indicates no negative class.
1.1 Function interval:
Function interval γ Gamma γ The distance between a sample point and the segmentation plane
γ I ^= Yi (w æ xi)+ B hat { gamma_i}= Y_I (w cdot x_i)+ B γ I ^= Yi (w æ xi)+ B.
Yiy_Iyi should represent a negative sign. Because there are only positive and negative ones.
Because Å w ∗ x+ B ∗æ | w ^ * cdot x+ B ^ * | æ w ∗ x+ B * æ can represent the distance relative to each other, and with the negative sign of the label, it can indicate whether the classification is correct.
If γ> 0 gamma> 0 γ> 0 indicates correct classification, with larger values indicating greater accuracy.
If γ< 0 gamma< 0 γ< 0 indicates w * æ x+ B * w ^ * cdot x+ B ^ * w ∗ x+ B * indicates that the predicted and labeled negative signs are different.
1.3 Geometric spacing:
γ I= Yi (w æ æ w Ü xi+ b Ü w æ) gamma_I= Y_I ( frac {w} {| w | |} cdot x_i+ frac {b} {| | w | |}) γ I= Yi.
The L2 norm is on w.
What we need to do is find a hyperplane that can divide the training set into two categories, maximizing their geometric intervals. .
2. SMO description algorithm:
The basic idea of SVM determines the objective function we want to optimize, namely Maximizing geometric spacing . When optimizing, we take the reciprocal, so we minimize the objective function.
Next, the objective function will be optimized using the Lagrange multiplier method, which requires the introduction of a new parameter set α Alpha α" The original problem is a minimax problem
L (w, b, a)= 12 ÅÅ w æÅ 2 − ∑ i= 1N α Iyi (wxi) ∑ I= 1N α IL (w, b, a)= Frac {1} {2} | | w | | ^ 2- sum_{i= 1} ^ N alpha_Iy_I (w cdot x_i+ b)+ Sum_{i= 1} ^ N alpha_IL (w, b, a)= 21 ÅÅ w æÅ 2 − i= 1 ∑ N α I yi (w æ xi+ b)+ I= 1 ∑ N α I
Original question:
Min w, bmax α L (w, b, a) min_{w, b} max_Alpha L (w, b, a) w, bmin α Max L (w, b, a).
If the conditions for the dual problem are met, it can be transformed into a Lagrangian dual problem and thus into a minimax problem:
Max α Min w, bL (w, b, a) max_Alpha min_{w, b} L (w, b, a) α Max w, bmin L (w, b, a).
How to derive the formulas that appear in the train of thought? It is recommended that everyone take a look at Li Hang's "Statistical Learning Method". The process is quite complicated, so I won't go into detail.
2.1 Algorithm Process
-
initialization α Alpha α Parameter set. The quantity is equal to the quantity of the example. each α Alpha α Corresponding to x, yx, yx, and y.
-
Select two α Alpha α, Record as α 1 α 2 alpha_1. alpha_two α 1 α 2. Record id1 and id2 (selection strategy or random selection can be used).
-
apply α 1 α 2 alpha_1. alpha_two α 1 α 2 Calculation errors E1, E2E_1. E_2E1, E2.
G (x)= ∑ I= 1N α IyiK (xi, x)+ BEi= G (xi) - yig (x)= Sum_{i= 1} ^ N alpha_Iy_I K (x_i, x)+ B E_I= G (x_i) - y_Ig (x)= I= 1 ∑ N α I yi K (xi, x)+ BEi= G (xi) - yi.
K (xi, x) K (x_i, x) K (xi, x) is the kernel function. All variables were used in calculating g (x) α 1 α N alpha_1 Alpha_N α 1 α N.
-
Calculation range, here y1, y2y_1. y_2y1, y2 is α 1, α 2 alpha_1. alpha_two α 1, α 2 corresponds to x1, x2x_1, x_2x1, x2 labels.
Y1= Y2y_1= Y_2y1= Y2:
L= Max (0, α 2old − α 1old) H= Min (C, C+ α 2old − α 1old) L= Max (0, alpha_2^ {old}- Alpha_1 ^ {old}) H= Min (C, C+ alpha_2^ {old}- Alpha_1 ^ {old}) L= Max (0, α 2old − α 1old) H= Min (C, C+ α 2old − α 1old)
Y1; Y2y_1= Y_2y1; Y2:
L= Max (0, α 2old; α 1old − C) H= Min (C, α 2old; α 1old) L= Max (0, alpha_2 ^ {old}+ alpha_1^ {old}-C ) H= Min (C, alpha_2 ^ {old}+ alpha_1 ^ {old}) L= Max (0, α 2old; α 1old − C) H= Min (C, α 2old; α 1old).C is a hyperparameter and a penalty parameter. The larger the C, the greater the penalty for misclassification.
-
calculate η Eta η
η= K11; K22 − 2K12 eta= K_{11} + K_{22} -2K_{12} η= K11; K22 − 2K12.Kij; K (xi, xj) K_{ij}= K (x_i, x_j) Kij= K (xi, xj).
here η Eta η It may be 0, so be prepared to divide by 0.
-
calculate α 2new alpha_2 ^ {new} α 2new, and limited to the H, L range.
α 2new; α 2old; Y2 (E1 − E2) η Alpha_2 ^ {new}= Alpha_2 ^ {old}+ Frac {y_2 (E1-- E2)} { eta} α 2new= α 2old; η Y2 (E1 − E2).
α 2new; Clip( α 2new, L, H) alpha_2 ^ {new}= Clip ( alpha_2 ^ {new}, L, H) α 2new= Clip( α 2new, L, H).
calculate α 1new; α 1old; Y1y2( α 2old − α 2new) alpha_1 ^ {new}= Alpha_1 ^ {old}+ Y_1y_2 ( alpha_2^ {old}- Alpha_2 ^ {new}) α 1new= α 1old; Y1 y2( α 2old − α 2new).
-
For the update of b:
B1new − E1 − y1K11( α 1 new − α 1old − y2K21( α 2new − α 2old+ Boldb2new − E2 − y1K12( α 1 new − α 1old − y2K22( α 2new − α 2old+ Boldb= (b1new+ b2new)/2b_1 ^ {new}=- E_1- y_1 K_{11} ( alpha_1)^ {new}- Alpha_1 ^ {old}) - y_2K_{21} ( alpha_2^ {new}- Alpha_2 ^ {old})+ B ^ {old} b_2 ^ {new}=- E_2-y_1K_{12} ( alpha_1)^ {new}- Alpha_1 ^ {old}) - y_2K_{22} ( alpha_2^ {new}- Alpha_2 ^ {old})+ B ^ {old} b= (b1 ^ {new}+ b2 ^ {new})/2b1new= − E1 − y1 K11( α 1 new − α 1old − y2 K21( α 2new − α 2old+ Boldb2new= − E2 − y1 K12( α 1 new − α 1old − y2 K22( α 2new − α 2old+ Boldb= (b1new+ b2new)/2.If α 1new and α 2new alpha_1 ^ {new} and alpha_2 ^ {new} α 1new and α If 2new is within the range of (0, C) (0, C) (0, C), then b1= B2.
If α Alpha α If there is one at the boundary, which is either 0 or C, then the number between b1 and b2 can be chosen, so choose the midpoint.
This is b updating formula b= (b1new b2new)/2b; (b1 ^ {new}+ b2 ^ {new})/2b= The reason for (b1new+ b2new)/2.
-
Go to step 2 until convergence occurs. .
2.2 Selection Strategy
2.2.1 α 1 alpha_one α Selection of 1 :
KKT conditions:
α I= 0, yif (xi) ≥ 1 (a) 0< α I< C. Yig (xi)= 1 (b) α I= C. Yig (xi) ≤ 1 (c) alpha_I= 0, y_If (x_i) geq 1 quad (a) 0< Alpha_I< C, y_Ig (x_i)= 1 quad (b) alpha_I= C, y_Ig (x_i) leq 1 quad (c) α I= 0, yi f (xi) ≥ 1 (a) 0< α I< C. Yi g (xi)= 1 (b) α I= C. Yi g (xi) ≤ 1 (c)
Our α I alpha_I α The range of i will be cut between [0, C] [0, C] [0, C], so violating the KKT condition only has two possibilities: a and c.
For the two cases of a and c, the error E is:
Ei= G (xi) - yiE_I= G (x_i) - y_I Ei= G (xi) - yi
We can simplify:
Yig (xi) − 1= Yi (f (xi) - yi)= YiEiy_Ig (x_i) -1= Y_I (f (x_i) - y_i) = Y_IE_Iyi g (xi) − 1= Yi (f (xi) - yi)= Yi Ei.
Yiy_Iyi is plus or minus 1, so it can be simplified like this.
So the above KKT conditions can be written as:
YiEi ≥ 0, α I= 0 (a) yiEi ≤ 0, α I= C (c) y_IE_I geq 0, alpha_I= 0 quad (a) y_IE_I leq 0, alpha_I= C quad (c) yi Ei ≥ 0, α I= 0 (a) yi Ei ≤ 0, α I= C (c)
There are two situations that violate KKT: .
Scenario a: yiEi> 0 α> 0y_IE_I> 0, alpha> 0yi Ei> 0 α> 0.
Situation c: yiEi< 0 α< Cy_IE_I< 0, alpha< Cyi Ei< 0 α< C.
Why is there no situation c?.
Because of itself α Alpha α The range of is between [0, C] [0, C] [0, C], and case b corresponds to yiEi= 0y_IE_I= 0yi Ei= If 0, it will definitely comply with KKT. Just consider two situations that are not equal to 0.
The KKT conditions are too strict, and a fault tolerance rate can be set:
YiEi< Toly_IE_I< Tolyi Ei< When TOL is used, it is considered yiEi= 0y_IE_I= 0yi Ei= 0 Using yiEi< Toly_IE_I< Tolyi Ei< Tol instead of yiEi; 0y_IE_I= 0yi Ei= 0.
2.2.2 α 2 alpha_two α 2
Selection:
α 2 alpha_two α 2 choices can enable α 2 alpha_two α 2 with significant changes:
α 2new; α 2old; Y2 (E1 − E2) η Alpha_2 ^ {new}= Alpha_2 ^ {old}+ Frac {y_2 (E1-- E2)} { eta} α 2new= α 2old; η Y2 (E1-E2)
visible α 2 alpha_two α The change of 2 depends on Å E1 − E2 Å | E_1-E_2 | Å E1 − E2 Å, select Å E1 − E2 Å | E_1-E_2 | æ E1 − E2 æ is the largest. We can maintain an E array during implementation to reduce the computational burden of each selection.
2.3 Prediction:
Use sign (g (x)) sign (g (x)) sign (g (x)) when making predictions
Sign (x)= {− 1 x< 0+ 1 x> 0 sign (x)= begin {cases} -1 quad x< 0 + 1 quad x> 0 end {cases}sign (x) = {− 1x< 0+ 1x> 0.
3. Code implementation:
Simple code implementation and testing:
GitHub.
4. Reference
[1] Statistical Learning Methods Second Edition Li Hang.