-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
module: tf32Related to tf32 data formatRelated to tf32 data formattriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Since the release of Ampere GPUs, pytorch has been using tf32 by default. It is providing much better performance at the expense of somewhat lower accuracy. Nvidia has conducted a lot of experiments proving that convergence behavior of a wide variety of networks does not change when tf32 is used instead of regular fp32.
However, since pytorch is used not only for deep learning workloads, or for some non-standard deep learning workloads, the use of tf32 for matrix multiplication has resulted in a lot of confusion and sometimes bad results.
Going forward, we have a few options
- Do nothing. Pros: most users get speedups, cons: issues and confusion will continue
- Leave the default as is, but add warning to the first call using tf32, explicit setting of
allow_tf32would silence this warning (proposed here. Pros - results will no longer be a surprise, and users will be able to get the expected results without resorting to documentation. Cons - pytorch doesn't warn for normal operations, so this would create bad precedent - Use tf32 in nn layers (linear, rnn etc), but disallow it for raw matmul operations (proposed here. Pros - networks using
nnoperations will get speed up while people experimenting with pure matmul operations will still get exact results. Cons - it is really confusing when linear layer that callsmatmulunder the hood produces different results thanmatmulcall. Also, if someone defines a network usingmatmulsand notnn.Linearornn.linearthey won't see the speed-up. - Disable tf32 matmul by default, possibly with the warning that it can be enabled. Pros - no surprise, accurate results. Cons - users who are getting speed-ups currently will have to manually enable tf32, or see reduced performance. If there's a warning, same con as option 2).
In my conversations with power users, they lean towards option 4), we should use this issue to find an acceptable solution.
cc @zasdfgbnm @ptrblck @CrisHY1995, @ssnl, @stas00, @t-vi, @ptrblck, @csarofeen, @zasdfgbnm, @wjablonski-work, please copy anyone else who is interested in this discussion.
sanchitintel, soumith, malfet, eqy, Sanster and 5 morestas00 and soumith
Metadata
Metadata
Assignees
Labels
module: tf32Related to tf32 data formatRelated to tf32 data formattriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module