Gradient Descent Optimization
SKKU Data Mining Lab
Hojin Yang
Index
Gradient Descent Method – batch, mini-batch, stochastic method
Problem case of GD
Gradient Descent Optimization – momentum, Adagrad, RMSprop, Adam
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
ℎ 𝜃 𝑥 = 𝜃𝑥 𝐽 𝜃 =
1
2 ∙ 8
Σ(ℎ 𝜃 𝑥 − 𝑦)2
Data(Experience)
Hypothesis(Task) Loss function(performance measure)
𝜃
𝐽 𝜃
2
Intro
First-order iterative optimization algorithm for finding the
minimum of a loss function
Gradient Descent Method
takes steps proportional to the negative of the gradient of
the function at the current point
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂 : learning rate
𝐽 𝜃 : loss function
∇ 𝜃 𝐽 𝜃 : gradient value for 𝜃
𝜃
𝐽 𝜃
2
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙
1
8
{(2𝜃 − 4) ∙ 2 + (3𝜃 − 6) ∙ 3 + ⋯ +(20𝜃 − 40) ∙ 20}
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Randomly selected
at each iteration
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙ (3.2𝜃 − 6.5) ∙ 3.2
𝐽′ 𝜃 = (𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
Specific x&y
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙
1
2
{(4𝜃 − 7.5) ∙ 4 + (3.2𝜃 − 6.5) ∙ 3.2}
𝐽′ 𝜃 =
1
b
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
b
Randomly selected
at each iteration(b=2)
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/GD-v-SGD.png?x64257
https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0410.png
Gradient Descent Method
# of data is m, At every iteration:
Batch: 𝒪 𝑚
Mini-batch(with batch size of k): 𝒪 𝑘
Stochastic: 𝒪 1
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
Python
class
𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95, iter=30
https://github.com/WegraLee
data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
ℎ 𝜃 𝑥 = 𝑥1 𝜃1 + 𝑥2 𝜃2
𝐽 𝜃 =
1
3
Σ(ℎ 𝜃 𝑥 − 𝑦)2
𝐽 𝜃 =
1
3
{ 1 ∙ 𝜃1 + 100 ∙ 𝜃2 − 10 2
+ 2 ∙ 𝜃1 + 200 ∙ 𝜃2 − 20 2
+ 3 ∙ 𝜃1 + 300 ∙ 𝜃2 − 30 2
}
=
1
3
{14 ∙ 𝜃1
2
+ ⋯ + 140000 ∙ 𝜃2
2
+ ⋯ }
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃
Gradient Descent Problem
iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂
-1 ∙ slope ∙ learning rate
iter 1:
iter 2:
10 ∙ 𝜂
0.1 ∙ 𝜂
slope: 15
slope: -0.05
-15 ∙ 𝜂
0.05 ∙ 𝜂
-1 ∙ slope ∙ learning rate
𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95 𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 1.01
data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
Feature Scaling
1 ≤ 𝑋1 ≤ 3
100 ≤ 𝑋2 ≤ 300
0 ≤ 𝑋1 ≤ 1
0 ≤ 𝑋2 ≤ 1
https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re
Gradient Descent Optimization
Main Idea
- Remember the movement in the past
- Reflect that on the current movement
Momentum(관성)
Offset effect
past
current
+
=
Accelerate effect
past
+
current
=
Saves proportion of the previous movements
Momentum(관성)
(𝛾 : usually about 0.9)
Momentum(관성)
(𝛾 : usually about 0.9)
Momentum(관성)
(𝛾 : usually about 0.9)
iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
-1 ∙ slope ∙ learning rate
iter 1:
iter 2(vanilla GD) :
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
slope: 15
slope: -0.05
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Offset effect
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Accelerate
effect
can expect to move out of local minima and
move to the better minima because of momentum
Avoiding Local Minima. Picture from http://www.yaldex.com.
Momentum(관성)
Need more memory(X2)
class Momentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
Python
class
https://github.com/WegraLee
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
Adagrad(Adaptive Gradient)
Main Idea
- Increase the learning rate of variables that have not changed much so far
- decrease the learning rate of variables that have much changed so far
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
Fixed →Adaptive!
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃
Adagrad(Adaptive Gradient)
Accumulate the square of gradient
Adagrad(Adaptive Gradient)
Accumulate the square of gradient
Adagrad(Adaptive Gradient)
As the cumulative value increases, the learning rate
decreases.
iter 1:
slope: -10
slope: -0.1
-10 ∙ 𝜂
-0.1 ∙ 𝜂
slope ∙ learning rate
iter 1(vanilla GD):
-10 ∙ 𝜂
-0.1 ∙ 𝜂
cache2=102
cache1= 0.12
iter 1(adagrad):
-10 ∙ (𝜂 / cache2) = −𝜂
-0.1 ∙ (𝜂 / cache1) = −𝜂
slope: -10
slope: -0.1
slope: 0.3
slope: -0.08
cache2=
102 + 0.32
cache1=
0.12
+ 0.082
iter 2(after update):
0.3 ∙ (𝜂 / cache2)
-0.08 ∙ (𝜂 / cache1)
iter 1:
10 ∙ (𝜂 / cache2) = −𝜂
0.1 ∙ (𝜂 / cache1) = −𝜂
class AdaGrad:
def __init__(self, lr=0.01):
self.lr = lr
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] += grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
Python
class
https://github.com/WegraLee
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.AdagradOptimizer(learning_rate,initial_accumulator_value).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
RMSProp
- the G part obtained by adding the square of the gradient
is replaced with exponential averages(지수평균)
- possible to maintain the relative size difference between
the variables of the recent change amount without
increasing G indefinitely.
https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwi0uszPs7_YAhVFybwKHcWRDfYQjhwIBQ&url=https%3A%2F%2Finsidehpc.com%2F2015%2F06%2Fpodcast-geoffrey-hinton-on-the-
rise-of-deep-learning%2F&psig=AOvVaw1Tpp31PE1Bg2r8cpN4KDUn&ust=1515192917829215
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Momentum
exponential averages of
previous slopes
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
RMSprop
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Adam에서는 m과 v가 처음에 0으로 초기화되어 있기 때문에 학습의 초반부에서는 mt,vtmt,vt가 0에
가깝게 bias 되어있을 것이라고 판단하여 이를 unbiased 하게 만들어주는 작업을 거친다.
mtmt 와 vtvt의 식을 ∑∑ 형태로 펼친 후 양변에 expectation을 씌워서 정리해보면, 다음과 같은 보정
을 통해 unbiased 된 expectation을 얻을 수 있다.
이 보정된 expectation들을 가지고 gradient가 들어갈 자리에 mt^mt^, GtGt가 들어갈 자리에 vt^vt^를
넣어 계산을 진행한다.
(𝛽1, 𝛽2 : usually about 0.9, 0.999)
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/05/Comparison-of-Adam-to-
Other-Optimization-Algorithms-Training-a-Multilayer-Perceptron.png
Cannot choose one solution
Use Adam in most case

Gradient descent optimizer

  • 1.
    Gradient Descent Optimization SKKUData Mining Lab Hojin Yang
  • 2.
    Index Gradient Descent Method– batch, mini-batch, stochastic method Problem case of GD Gradient Descent Optimization – momentum, Adagrad, RMSprop, Adam
  • 3.
    X Y 2 4 36 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 ℎ 𝜃 𝑥 = 𝜃𝑥 𝐽 𝜃 = 1 2 ∙ 8 Σ(ℎ 𝜃 𝑥 − 𝑦)2 Data(Experience) Hypothesis(Task) Loss function(performance measure) 𝜃 𝐽 𝜃 2 Intro
  • 4.
    First-order iterative optimizationalgorithm for finding the minimum of a loss function Gradient Descent Method takes steps proportional to the negative of the gradient of the function at the current point 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 𝜂 : learning rate 𝐽 𝜃 : loss function ∇ 𝜃 𝐽 𝜃 : gradient value for 𝜃 𝜃 𝐽 𝜃 2
  • 5.
    •Batch gradient descent:Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝐽′ 𝜃 = 1 8 Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
  • 6.
    •Batch gradient descent:Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ 1 8 {(2𝜃 − 4) ∙ 2 + (3𝜃 − 6) ∙ 3 + ⋯ +(20𝜃 − 40) ∙ 20} 𝐽′ 𝜃 = 1 8 Σ(𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
  • 7.
    •Batch gradient descent:Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Randomly selected at each iteration 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ (3.2𝜃 − 6.5) ∙ 3.2 𝐽′ 𝜃 = (𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃 Specific x&y
  • 8.
    •Batch gradient descent:Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ 1 2 {(4𝜃 − 7.5) ∙ 4 + (3.2𝜃 − 6.5) ∙ 3.2} 𝐽′ 𝜃 = 1 b Σ(𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃 b Randomly selected at each iteration(b=2)
  • 9.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40
  • 10.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40
  • 11.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 12.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 13.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 14.
    Gradient Descent Method 𝐽𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 15.
  • 16.
    class SGD: def __init__(self,lr=0.01): self.lr = lr def update(self, params, grads): for key in params.keys(): params[key] -= self.lr * grads[key] Python class 𝑤 = 𝑥2 20 + 𝑦2 , Learning rate = 0.95, iter=30 https://github.com/WegraLee
  • 17.
    data X1 X2Y #1 1 100 10 #2 2 200 20 #3 3 300 30 ℎ 𝜃 𝑥 = 𝑥1 𝜃1 + 𝑥2 𝜃2 𝐽 𝜃 = 1 3 Σ(ℎ 𝜃 𝑥 − 𝑦)2 𝐽 𝜃 = 1 3 { 1 ∙ 𝜃1 + 100 ∙ 𝜃2 − 10 2 + 2 ∙ 𝜃1 + 200 ∙ 𝜃2 − 20 2 + 3 ∙ 𝜃1 + 300 ∙ 𝜃2 − 30 2 } = 1 3 {14 ∙ 𝜃1 2 + ⋯ + 140000 ∙ 𝜃2 2 + ⋯ } 𝜃1 𝜃2 𝐽 𝜃 𝐽 𝜃 Gradient Descent Problem
  • 20.
    iter 1: slope: -10 slope:-0.1 10 ∙ 𝜂 0.1 ∙ 𝜂 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 𝜂 -1 ∙ slope ∙ learning rate
  • 21.
    iter 1: iter 2: 10∙ 𝜂 0.1 ∙ 𝜂 slope: 15 slope: -0.05 -15 ∙ 𝜂 0.05 ∙ 𝜂 -1 ∙ slope ∙ learning rate
  • 22.
    𝑤 = 𝑥2 20 + 𝑦2 ,Learning rate = 0.95 𝑤 = 𝑥2 20 + 𝑦2 , Learning rate = 1.01
  • 23.
    data X1 X2Y #1 1 100 10 #2 2 200 20 #3 3 300 30 Feature Scaling 1 ≤ 𝑋1 ≤ 3 100 ≤ 𝑋2 ≤ 300 0 ≤ 𝑋1 ≤ 1 0 ≤ 𝑋2 ≤ 1 https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re
  • 24.
  • 25.
    Main Idea - Rememberthe movement in the past - Reflect that on the current movement Momentum(관성) Offset effect past current + = Accelerate effect past + current =
  • 26.
    Saves proportion ofthe previous movements Momentum(관성) (𝛾 : usually about 0.9)
  • 27.
  • 28.
  • 29.
    iter 1: slope: -10 slope:-0.1 10 ∙ 𝜂 0.1 ∙ 𝜂 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 -1 ∙ slope ∙ learning rate
  • 30.
    iter 1: iter 2(vanillaGD) : 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 slope: 15 slope: -0.05
  • 31.
    iter 1: iter 2(beforeadd past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂
  • 32.
    iter 1: iter 2(beforeadd past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂 Offset effect
  • 33.
    iter 1: iter 2(beforeadd past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂 Accelerate effect
  • 34.
    can expect tomove out of local minima and move to the better minima because of momentum Avoiding Local Minima. Picture from http://www.yaldex.com. Momentum(관성) Need more memory(X2)
  • 35.
    class Momentum: def __init__(self,lr=0.01, momentum=0.9): self.lr = lr self.momentum = momentum self.v = None def update(self, params, grads): if self.v is None: self.v = {} for key, val in params.items(): self.v[key] = np.zeros_like(val) for key in params.keys(): self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] params[key] += self.v[key] Python class https://github.com/WegraLee
  • 36.
    W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode= tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow
  • 38.
    Adagrad(Adaptive Gradient) Main Idea -Increase the learning rate of variables that have not changed much so far - decrease the learning rate of variables that have much changed so far 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 Fixed →Adaptive! 𝜃1 𝜃2 𝐽 𝜃 𝐽 𝜃
  • 39.
  • 40.
    Accumulate the squareof gradient Adagrad(Adaptive Gradient)
  • 41.
    Accumulate the squareof gradient Adagrad(Adaptive Gradient) As the cumulative value increases, the learning rate decreases.
  • 42.
    iter 1: slope: -10 slope:-0.1 -10 ∙ 𝜂 -0.1 ∙ 𝜂 slope ∙ learning rate
  • 43.
    iter 1(vanilla GD): -10∙ 𝜂 -0.1 ∙ 𝜂 cache2=102 cache1= 0.12 iter 1(adagrad): -10 ∙ (𝜂 / cache2) = −𝜂 -0.1 ∙ (𝜂 / cache1) = −𝜂 slope: -10 slope: -0.1
  • 44.
    slope: 0.3 slope: -0.08 cache2= 102+ 0.32 cache1= 0.12 + 0.082 iter 2(after update): 0.3 ∙ (𝜂 / cache2) -0.08 ∙ (𝜂 / cache1) iter 1: 10 ∙ (𝜂 / cache2) = −𝜂 0.1 ∙ (𝜂 / cache1) = −𝜂
  • 45.
    class AdaGrad: def __init__(self,lr=0.01): self.lr = lr self.h = None def update(self, params, grads): if self.h is None: self.h = {} for key, val in params.items(): self.h[key] = np.zeros_like(val) for key in params.keys(): self.h[key] += grads[key] * grads[key] params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7) Python class https://github.com/WegraLee
  • 46.
    W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode= tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.AdagradOptimizer(learning_rate,initial_accumulator_value).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow
  • 48.
    RMSProp - the Gpart obtained by adding the square of the gradient is replaced with exponential averages(지수평균) - possible to maintain the relative size difference between the variables of the recent change amount without increasing G indefinitely. https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwi0uszPs7_YAhVFybwKHcWRDfYQjhwIBQ&url=https%3A%2F%2Finsidehpc.com%2F2015%2F06%2Fpodcast-geoffrey-hinton-on-the- rise-of-deep-learning%2F&psig=AOvVaw1Tpp31PE1Bg2r8cpN4KDUn&ust=1515192917829215
  • 49.
  • 50.
    Adam(Adaptive Moment Estimation) Hybridof Momentum and RMSprop Momentum exponential averages of previous slopes
  • 51.
    Adam(Adaptive Moment Estimation) Hybridof Momentum and RMSprop RMSprop
  • 52.
    Adam(Adaptive Moment Estimation) Hybridof Momentum and RMSprop Adam에서는 m과 v가 처음에 0으로 초기화되어 있기 때문에 학습의 초반부에서는 mt,vtmt,vt가 0에 가깝게 bias 되어있을 것이라고 판단하여 이를 unbiased 하게 만들어주는 작업을 거친다. mtmt 와 vtvt의 식을 ∑∑ 형태로 펼친 후 양변에 expectation을 씌워서 정리해보면, 다음과 같은 보정 을 통해 unbiased 된 expectation을 얻을 수 있다. 이 보정된 expectation들을 가지고 gradient가 들어갈 자리에 mt^mt^, GtGt가 들어갈 자리에 vt^vt^를 넣어 계산을 진행한다. (𝛽1, 𝛽2 : usually about 0.9, 0.999)
  • 53.
    W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode= tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow
  • 55.