Skip to content

Commit e51e5e7

Browse files
authored
[optim] Add general documentation on our algorithm defaults (#95391) (#95516)
I added a section + table under Algorithms https://docs-preview.pytorch.org/95391/optim.html?highlight=optim#module-torch.optim <img width="725" alt="image" src="https://user-images.githubusercontent.com/31798555/221246256-99325a27-9016-407b-a9fe-404d61e41a82.png"> Pull Request resolved: #95391 Approved by: https://github.com/albanD
1 parent 91739a0 commit e51e5e7

File tree

1 file changed

+43
-0
lines changed

1 file changed

+43
-0
lines changed

docs/source/optim.rst

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,49 @@ Algorithms
129129
Rprop
130130
SGD
131131

132+
Many of our algorithms have various implementations optimized for performance,
133+
readability and/or generality, so we attempt to default to the generally fastest
134+
implementation for the current device if no particular implementation has been
135+
specified by the user.
136+
137+
We have 3 major categories of implementations: for-loop, foreach (multi-tensor), and
138+
fused. The most straightforward implementations are for-loops over the parameters with
139+
big chunks of computation. For-looping is usually slower than our foreach
140+
implementations, which combine parameters into a multi-tensor and run the big chunks
141+
of computation all at once, thereby saving many sequential kernel calls. A few of our
142+
optimizers have even faster fused implementations, which fuse the big chunks of
143+
computation into one kernel. We can think of foreach implementations as fusing
144+
horizontally and fused implementations as fusing vertically on top of that.
145+
146+
In general, the performance ordering of the 3 implementations is fused > foreach > for-loop.
147+
So when applicable, we default to foreach over for-loop. Applicable means the foreach
148+
implementation is available, the user has not specified any implementation-specific kwargs
149+
(e.g., fused, foreach, differentiable), and all tensors are native and on CUDA. Note that
150+
while fused should be even faster than foreach, the implementations are newer and we would
151+
like to give them more bake-in time before flipping the switch everywhere. You are welcome
152+
to try them out though!
153+
154+
Below is a table showing the available and default implementations of each algorithm:
155+
156+
.. csv-table::
157+
:header: "Algorithm", "Default", "Has foreach?", "Has fused?"
158+
:widths: 25, 25, 25, 25
159+
:delim: ;
160+
161+
:class:`Adadelta`;foreach;yes;no
162+
:class:`Adagrad`;foreach;yes;no
163+
:class:`Adam`;foreach;yes;yes
164+
:class:`AdamW`;foreach;yes;yes
165+
:class:`SparseAdam`;for-loop;no;no
166+
:class:`Adamax`;foreach;yes;no
167+
:class:`ASGD`;foreach;yes;no
168+
:class:`LBFGS`;for-loop;no;no
169+
:class:`NAdam`;foreach;yes;no
170+
:class:`RAdam`;foreach;yes;no
171+
:class:`RMSprop`;foreach;yes;no
172+
:class:`Rprop`;foreach;yes;no
173+
:class:`SGD`;foreach;yes;no
174+
132175
How to adjust learning rate
133176
---------------------------
134177

0 commit comments

Comments
 (0)