When using small max_norm values like 0.1, 1 or 2, the comparison on CUDA fails with large error (Half w/ 0.1 fails with error ~0.16), but passes on CPU.
Relevant lines: https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L1674-L1706
Discovered when doing #7959
cc @adamlerer