Introduction to PyTorch

RTSS Jun Young Park
Introduction to PyTorch

Objective
 Understanding AutoGrad
 Review
 Logistic Classifier
 Loss Function
 Backpropagation
 Chain Rule
 Example : Find gradient from a matrix
 AutoGrad
 Solve the example with AutoGrad
 Data Parallism in PyTorch
 Why should we use GPUs?
 Inside CUDA
 How to parallelize our models
 Experiment

Simple but powerful implementation of backpropagation
Understanding AutoGrad

Logistic Classifier (Fully-Connected)
𝑊𝑋 + b = y
2.0
1.0
0.1
p = 0.7
p = 0.2
p = 0.1
S(y)
ProbabilityLogits
X : Input
W, b : To be trained
y : Prediction
S(y) : Softmax function (Can be other activation functions)
A
B
C
𝑆 𝑦 =
𝑒 𝑦 𝑖
𝑖 𝑒 𝑦 𝑖
represents the probabilities of elements in vector 𝑦.
A
Instance

Distance
A
0.7
0.2
0.1
Probability
1
0
0
One-Hot Encoded
A
B
C
MAX
Loss
Find W, b that minimize the loss(error).
Predict Label

Loss Function
 The vector can be very large when there are a lot of classes.
 How can we find the distance between vector S(Predict) and L(Label) ?
𝐷 𝑆, 𝐿 = −
𝑖
𝐿𝑖 log(𝑆𝑖)
0.7
0.2
0.1
1.0
0.0
0.0
S(y) L
※ D(S,L) ≠ D(L,S)
Don’t worry to take log(0)
𝑆 𝑦 =
𝑒 𝑦𝑖
𝑖 𝑒 𝑦 𝑖

In-depth of Classifier
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏
• Gradient Descent requires
𝜕𝐸
𝜕𝑤
and
𝜕𝐸
𝜕𝑏
.
• How can we find them? -> Use chain rule !
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 : Training data
𝑦 : Prediction result

Chain Rule
• Let y(x) is defined below, 𝑥 influences 𝑔 𝑥 and 𝑔 𝑥 influences 𝑓 𝑔 𝑥
𝑦 𝑥 = 𝑓 𝑔 𝑥 = 𝑓 ∘ 𝑔(𝑥)
• Find derivation of y(x)
𝑦′
𝑥 = 𝑓′
𝑔 𝑥 𝑔′
𝑥
• in Liebniz notation…
𝑑𝑦
𝑑𝑥
=
𝑑𝑦
𝑑𝑓
𝑑𝑓
𝑑𝑔
𝑑𝑔
𝑑𝑥
= 1 ∗ 𝑓′ 𝑔 𝑥 ∗ 𝑔′(𝑥)

Chain Rule
𝜕𝐸
𝜕𝑤
=
𝜕𝐸
𝜕𝑦
𝜕𝑦
𝜕𝜎
𝜕𝜎
𝜕𝑤
=
𝑥 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 (𝜎 > 0)
0 (𝜎 ≤ 0)
𝜕𝐸
𝜕𝑦
= 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 ,
𝜕𝑦
𝜕𝜎
=
1 (𝜎 > 0)
0 (𝜎 ≤ 0)
,
𝜕𝜎
𝜕𝑤
= 𝑥
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏

Example : Finding gradient of 𝑋
 Let input tensor 𝑋 is initialized by following square matrix of 3rd order.
𝑋 =
1 2 3
4 5 6
7 8 9
 And 𝑌, 𝑍 is defined following …
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
 And output 𝛿 is the average of tensor 𝑍
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗

 We can find scalar 𝑍𝑖𝑗 from its definition (Linearity)
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝑌𝑖𝑗 = 𝑋𝑖𝑗 + 3
 To find gradient, We use ‘Chain Rule’ so that we can find partial gradients.
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
,
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
= 12𝑌𝑖𝑗,
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
= 1
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 12𝑌𝑖𝑗 ∗ 1 =
4
3
𝑋𝑖𝑗 + 3

 Thus, We can get a gradient of (1,1) element of 𝑋
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
𝑋𝑖𝑗 + 3 |(𝑖, 𝑗)=(1,1) =
4
3
1 + 3 =
16
3
 Like this, We can get whole gradient matrix of 𝑋 …
𝜕𝛿
𝜕 𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3

AutoGrad : Finding gradient of 𝑋
𝑋 =
1 2 3
4 5 6
7 8 9
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗
𝜕𝛿
𝜕𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3
Each operation has its gradient function.

Back Propagation
 Get derivatives using ‘Back Propagation’
+
𝑥
𝑦
𝑧
𝑧 = 𝑥 + 𝑦
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
= 1
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
x
𝑥
𝑦
𝑧
𝑧 = 𝑥𝑦
𝜕𝑧
𝜕𝑥
= 𝑦,
𝜕𝑧
𝜕𝑦
= 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
∙ 𝑦
From output signal 𝐿 …
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
∙ 𝑥

Back Propagation
 How about exponentation function?
^
𝑛
𝑥 𝑧
𝑧 = 𝑥 𝑛
𝜕𝑧
𝜕𝑥
= 𝑛𝑥 𝑛−1
,
𝜕𝑧
𝜕𝑛
= 𝑥 𝑛
ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
(𝑛𝑥 𝑛−1
)
From output signal 𝐿 …
𝑧 = 𝑥 𝑛
ln 𝑧 = 𝑛 ln 𝑥
1
𝑧
𝑑𝑧 = ln 𝑥 𝑑𝑛
𝑑𝑧
𝑑𝑛
= 𝑧 ln 𝑥 = 𝑥 𝑛 ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑛
=
𝜕𝐿
𝜕𝑧
(𝑥 𝑛
ln 𝑥)

Appendix : Operation Graph of 𝛿 (Matrix)
+𝑋11 ^
𝑌11
x
2 2 6
x
1
9
+
𝑍11
𝑋12
…
…
…
𝑋33
…
…
…
… 𝑍12
𝑍33
𝛿
…
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍

Appendix : Operation Graph of 𝛿 (Scalar)
- Backpropagation
+𝑋𝑖𝑗 ^
𝑌𝑖𝑗
x
2 6
x
1
9
+
𝑍𝑖𝑗
𝛿
+𝑋𝑖𝑗 ^ x x+
𝑍 𝑠𝑢𝑚
2
𝛽𝑖𝑗𝛼𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗 ∗ 2 =
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛽𝑖𝑗
=
1
9
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
∗ 1
𝜕𝛿
𝜕𝛼𝑖𝑗
=
1
9
∗ 1 ∗ 6
=
𝜕𝛿
𝜕𝛽𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
𝜕𝛿
𝜕𝑌𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛿
= 1
𝛿

Why GPU? (CUDA)
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
…
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.6 GHz
(2.0 GHz @ O.C)

Dataflow Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ sum()
hello.cu
NVCC
Co-processor
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call (cuBLAS)
3.Memcpy

CUDA on Multi GPU System
Quad SLI
14,336 CUDA cores
48GB of VRAM
How can we use multi GPUs in PyTorch?

Problem
- Low utilization
Only allocated
single GPU.
Zero Utilization
Redundant Memory

Problem
- Duration & Memory Allocation
 Large batch size causes lack of memory.
 Out of memory error from PyTorch -> Python kernel dies.
 Can’t set large batch size.
 Can afford batch_size = 5, num_workers = 2
 Can’t divide up the work with the other GPUs
 Elapsed Time : 25m 44s (10 epochs)
 Reached 99% of accuracy in 9 epochs (for training set)
 It takes too much time.

Data Parallelism in PyTorch
 Implemented using torch.nn.DataParallel()
 Can be used for wrapping a module or model.
 Also support primitives (torch.nn.parallel.*)
 Replicate : Replicate the model on multiple devices(GPUs)
 Scatter : Distribute the input in the first-dimension.
 Gather : Gather and concatenate the input in the first-dimension.
 Apply-Parallel : Apply a set of already-distributed inputs to a set of already-distributed
models.
 PyTorch Tutorials – Multi-GPU examples
 https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

Easy to Use : nn.DataParallel(model)
- Practical Example
1. Define the model.
2. Wrap the model with nn.DataParallel().
3. Access layers through ‘module’

After Parallelism
- GPU Utilization
 Hyperparameters
 Batch Size : 128
 Number of Workers : 16
 High Utilization.
 Can use large memory space.
 Allocated all GPUs

After Parallelism
- Training Performance
 Hyperparameters
 Batch Size : 128
 Large batch size need more memory space
 Number of Workers : 16
 Recommended to set (4 * NUM_GPUs) – From the forum
 Elapsed Time : 7m 50s (10 epochs)
 Reached 99% of accuracy in 4 epochs (for training set).
 It just taken 3m 10s.

Introduction to PyTorch

More Related Content

What's hot

Similar to Introduction to PyTorch

More from Jun Young Park

Recently uploaded

Introduction to PyTorch

Editor's Notes