Skip to content

Precision of sparse float embeddings differs from dense embeddings on CPU #20022

@nairbv

Description

@nairbv

🐛 Bug

In test/test_nn.py we skip 'backward' for low-precision types (float, half) because the precision is often too low to get reliable results on large embeddings. The same test doesn't fail for dense embeddings. There's a limit to how much precision we can expect with float and half types, but it would be preferable if they were consistent or if the difference was clearer.

To Reproduce

Steps to reproduce the behavior:

in test/test_nn.py run this with test_backward=True and dtype=torch.float.
self._test_EmbeddingBag(False, 'sum', True, test_backward=test_backward, dtype=dtype)

Run a number of times and it will occasionally fail. With the third parameter (sparse) set to False, we don't see failures.

Expected behavior

Limitations on precision are consistent between sparse and dense implementations of Embedding/EmbeddingBag.

Environment

[bvaughan@devgpu005.ash6 ~/repos/pytorch] ./collect_env.sh
bash: ./collect_env.sh: No such file or directory
[bvaughan@devgpu005.ash6 ~/repos/pytorch] python ./collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
CMake version: version 3.12.2

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Tesla M40
GPU 1: Tesla M40
GPU 2: Tesla M40
GPU 3: Tesla M40
GPU 4: Tesla M40
GPU 5: Tesla M40
GPU 6: Tesla M40
GPU 7: Tesla M40

Nvidia driver version: 396.69
cuDNN version: /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip3] numpy==1.15.4
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0a0+3900816
[pip3] torchvision==0.2.1
[conda] magma-cuda92 2.4.0 1 pytorch
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] mkl-service 1.1.2 py37h90e4bf4_5
[conda] mkl_fft 1.0.4 py37h4414c95_1
[conda] mkl_random 1.0.1 py37h4414c95_1
[conda] mkldnn 0.16.1 0 mingfeima
[conda] torch 1.0.0a0+aaf6e36
[conda] torch 1.1.0a0+0676ba0
[conda] torch 1.0.0a0+c2f1811
[conda] torch 1.0.0a0+e387d94
[conda] torch 1.0.0a0+298b775
[conda] torch 1.0.0a0+8de9564
[conda] torch 1.0.0a0+b15242f
[conda] torch 1.0.0a0+df022f8
[conda] torch 1.0.0a0+9c20546
[conda] torch 1.0.0a0+35a24a9
[conda] torch 1.0.0a0+d4f9dbf
[conda] torch 1.0.0a0+4a4cc13
[conda] torch 1.0.0a0+e03136f
[conda] torch 1.0.0a0+c715fcc
[conda] torch 1.0.0a0+b8da44d
[conda] torch 1.0.0a0+5c51f65
[conda] torch 1.1.0a0+227c4e9
[conda] torch 1.0.0a0+66a0447
[conda] torch 1.0.0a0+fb8745e
[conda] torch 1.0.0a0+a7445ad
[conda] torch 1.0.0a0+6e0c5a8
[conda] torch 1.1.0a0+71bdfe8
[conda] torch 1.1.0a0+3900816
[conda] torch 1.0.0a0+607094c
[conda] torch 1.0.0a0+3ff7071
[conda] torch 1.0.0a0+24c43e2
[conda] torchvision 0.2.1
[bvaughan@devgpu005.ash6 ~/repos/pytorch]

Additional context

encountered while working on:
#19695

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: nnRelated to torch.nnmodule: numerical-reproducibilitytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions