Gradient descent optimizer

Gradient Descent Optimization
SKKU Data Mining Lab
Hojin Yang

Index
Gradient Descent Method – batch, mini-batch, stochastic method
Problem case of GD
Gradient Descent Optimization – momentum, Adagrad, RMSprop, Adam

X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
ℎ 𝜃 𝑥 = 𝜃𝑥 𝐽 𝜃 =
1
2 ∙ 8
Σ(ℎ 𝜃 𝑥 − 𝑦)2
Data(Experience)
Hypothesis(Task) Loss function(performance measure)
𝜃
𝐽 𝜃
2
Intro

First-order iterative optimization algorithm for finding the
minimum of a loss function
Gradient Descent Method
takes steps proportional to the negative of the gradient of
the function at the current point
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂 : learning rate
𝐽 𝜃 : loss function
∇ 𝜃 𝐽 𝜃 : gradient value for 𝜃
𝜃
𝐽 𝜃
2

•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥

X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
𝜃 ≔ 𝜃 − 𝜂 ∙
1
8
{(2𝜃 − 4) ∙ 2 + (3𝜃 − 6) ∙ 3 + ⋯ +(20𝜃 − 40) ∙ 20}
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃

X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Randomly selected
at each iteration
𝐽 𝜃 =
1
2 ∙ 8
𝜃 ≔ 𝜃 − 𝜂 ∙ (3.2𝜃 − 6.5) ∙ 3.2
𝐽′ 𝜃 = (𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
Specific x&y

X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
𝜃 ≔ 𝜃 − 𝜂 ∙
1
2
{(4𝜃 − 7.5) ∙ 4 + (3.2𝜃 − 6.5) ∙ 3.2}
𝐽′ 𝜃 =
1
b
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
b
Randomly selected
at each iteration(b=2)

𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40

𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent

https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/GD-v-SGD.png?x64257
https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0410.png
# of data is m, At every iteration:
Batch: 𝒪 𝑚
Mini-batch(with batch size of k): 𝒪 𝑘
Stochastic: 𝒪 1

class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
Python
class
𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95, iter=30
https://github.com/WegraLee

data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
ℎ 𝜃 𝑥 = 𝑥1 𝜃1 + 𝑥2 𝜃2
𝐽 𝜃 =
1
3
Σ(ℎ 𝜃 𝑥 − 𝑦)2
𝐽 𝜃 =
1
3
{ 1 ∙ 𝜃1 + 100 ∙ 𝜃2 − 10 2
+ 2 ∙ 𝜃1 + 200 ∙ 𝜃2 − 20 2
+ 3 ∙ 𝜃1 + 300 ∙ 𝜃2 − 30 2
}
=
1
3
{14 ∙ 𝜃1
2
+ ⋯ + 140000 ∙ 𝜃2
2
+ ⋯ }
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃
Gradient Descent Problem

iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂
-1 ∙ slope ∙ learning rate

iter 1:
iter 2:
10 ∙ 𝜂
0.1 ∙ 𝜂
slope: 15
slope: -0.05
-15 ∙ 𝜂
0.05 ∙ 𝜂

𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95 𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 1.01

data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
Feature Scaling
1 ≤ 𝑋1 ≤ 3
100 ≤ 𝑋2 ≤ 300
0 ≤ 𝑋1 ≤ 1
0 ≤ 𝑋2 ≤ 1
https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re

Main Idea
- Remember the movement in the past
- Reflect that on the current movement
Momentum(관성)
Offset effect
past
current
+
=
Accelerate effect
past
+
current
=

Saves proportion of the previous movements
Momentum(관성)
(𝛾 : usually about 0.9)

Momentum(관성)
(𝛾 : usually about 0.9)

iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃

iter 1:
iter 2(vanilla GD) :
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
slope: 15
slope: -0.05

iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂

iter 1:
0.9 X
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Offset effect

iter 1:
0.9 X
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Accelerate
effect

can expect to move out of local minima and
move to the better minima because of momentum
Avoiding Local Minima. Picture from http://www.yaldex.com.
Momentum(관성)
Need more memory(X2)

class Momentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
Python
class

W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow

Adagrad(Adaptive Gradient)
Main Idea
- Increase the learning rate of variables that have not changed much so far
- decrease the learning rate of variables that have much changed so far
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
Fixed →Adaptive!
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃

Accumulate the square of gradient

Accumulate the square of gradient
As the cumulative value increases, the learning rate
decreases.

iter 1:
slope: -10
slope: -0.1
-10 ∙ 𝜂
-0.1 ∙ 𝜂
slope ∙ learning rate

iter 1(vanilla GD):
-10 ∙ 𝜂
-0.1 ∙ 𝜂
cache2=102
cache1= 0.12
iter 1(adagrad):
-10 ∙ (𝜂 / cache2) = −𝜂
-0.1 ∙ (𝜂 / cache1) = −𝜂
slope: -10
slope: -0.1

slope: 0.3
slope: -0.08
cache2=
102 + 0.32
cache1=
0.12
+ 0.082
iter 2(after update):
0.3 ∙ (𝜂 / cache2)
-0.08 ∙ (𝜂 / cache1)
iter 1:
10 ∙ (𝜂 / cache2) = −𝜂
0.1 ∙ (𝜂 / cache1) = −𝜂

class AdaGrad:
def __init__(self, lr=0.01):
self.lr = lr
self.h = None
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
self.h[key] += grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
Python
class

optimizer = tf.train.AdagradOptimizer(learning_rate,initial_accumulator_value).minimize(cost)
sess = tf.Session()
sess.run(init)
Tensor flow

RMSProp
- the G part obtained by adding the square of the gradient
is replaced with exponential averages(지수평균)
- possible to maintain the relative size difference between
the variables of the recent change amount without
increasing G indefinitely.
https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwi0uszPs7_YAhVFybwKHcWRDfYQjhwIBQ&url=https%3A%2F%2Finsidehpc.com%2F2015%2F06%2Fpodcast-geoffrey-hinton-on-the-
rise-of-deep-learning%2F&psig=AOvVaw1Tpp31PE1Bg2r8cpN4KDUn&ust=1515192917829215

Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop

Momentum
exponential averages of
previous slopes

RMSprop

Adam에서는 m과 v가 처음에 0으로 초기화되어 있기 때문에 학습의 초반부에서는 mt,vtmt,vt가 0에
가깝게 bias 되어있을 것이라고 판단하여 이를 unbiased 하게 만들어주는 작업을 거친다.
mtmt 와 vtvt의 식을 ∑∑ 형태로 펼친 후 양변에 expectation을 씌워서 정리해보면, 다음과 같은 보정
을 통해 unbiased 된 expectation을 얻을 수 있다.
이 보정된 expectation들을 가지고 gradient가 들어갈 자리에 mt^mt^, GtGt가 들어갈 자리에 vt^vt^를
넣어 계산을 진행한다.
(𝛽1, 𝛽2 : usually about 0.9, 0.999)

optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(cost)
sess = tf.Session()
sess.run(init)
Tensor flow

https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/05/Comparison-of-Adam-to-
Other-Optimization-Algorithms-Training-a-Multilayer-Perceptron.png
Cannot choose one solution
Use Adam in most case

Gradient descent optimizer

More Related Content

What's hot

Similar to Gradient descent optimizer

Recently uploaded

Gradient descent optimizer