[PyTorch] EmbeddingBag comparison vs Embedding fails w/ small max_norm on CUDA

When using small `max_norm` values like `0.1`, `1` or `2`, the comparison on CUDA fails with large error (Half w/ 0.1 fails with error ~`0.16`), but passes on CPU.

Relevant lines: https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L1674-L1706 

Discovered when doing #7959 

cc @adamlerer