Skip to content

RFC: Should matmuls use tf32 by default? #67384

@ngimel

Description

@ngimel

Since the release of Ampere GPUs, pytorch has been using tf32 by default. It is providing much better performance at the expense of somewhat lower accuracy. Nvidia has conducted a lot of experiments proving that convergence behavior of a wide variety of networks does not change when tf32 is used instead of regular fp32.
However, since pytorch is used not only for deep learning workloads, or for some non-standard deep learning workloads, the use of tf32 for matrix multiplication has resulted in a lot of confusion and sometimes bad results.
Going forward, we have a few options

  1. Do nothing. Pros: most users get speedups, cons: issues and confusion will continue
  2. Leave the default as is, but add warning to the first call using tf32, explicit setting of allow_tf32 would silence this warning (proposed here. Pros - results will no longer be a surprise, and users will be able to get the expected results without resorting to documentation. Cons - pytorch doesn't warn for normal operations, so this would create bad precedent
  3. Use tf32 in nn layers (linear, rnn etc), but disallow it for raw matmul operations (proposed here. Pros - networks using nn operations will get speed up while people experimenting with pure matmul operations will still get exact results. Cons - it is really confusing when linear layer that calls matmul under the hood produces different results than matmul call. Also, if someone defines a network using matmuls and not nn.Linear or nn.linear they won't see the speed-up.
  4. Disable tf32 matmul by default, possibly with the warning that it can be enabled. Pros - no surprise, accurate results. Cons - users who are getting speed-ups currently will have to manually enable tf32, or see reduced performance. If there's a warning, same con as option 2).
    In my conversations with power users, they lean towards option 4), we should use this issue to find an acceptable solution.
    cc @zasdfgbnm @ptrblck @CrisHY1995, @ssnl, @stas00, @t-vi, @ptrblck, @csarofeen, @zasdfgbnm, @wjablonski-work, please copy anyone else who is interested in this discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: tf32Related to tf32 data formattriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions