speed up CPU EmbeddingBag (indexSelectAdd op) #5433

martinraison · 2018-02-27T14:53:35Z

As discussed with @cpuhrsch, this should speed up the CPU version of EmbeddingBag in a few cases. The main idea is to avoid creating a large intermediary tensor during the forward pass, using the new index_select_add_ operation (which fuses index_select and index_add_).
I also slightly optimized the backward to replace a bunch of divisions by a few divisions and a bunch of multiplications.

Benchmark results on one CPU, for the forward pass only:

Original code

$ NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.349s          1.048s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  14.371s          1.269s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.385s          1.074s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.947s          1.254s

New code

$ NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   4.921s          1.037s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   9.256s          1.288s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   4.929s          1.073s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  14.210s          1.440s

The benchmark code is the same as in #4856 except that I removed the backward pass and the dense tests.

I also tried including the backward pass and multiple CPUs. Perhaps surprisingly, I didn't see any significant change in that scenario, although in my original "real life" scenario, the new code makes overall training about 30% faster. I haven't yet found what's different in the benchmark setup.

Original code

$ NUMEXPR_NUM_THREADS=8 MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 taskset -c 0-7 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   7.875s          2.946s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.310s          3.715s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   8.180s          3.122s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  21.451s          3.562s

New code

$ NUMEXPR_NUM_THREADS=8 MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 taskset -c 0-7 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   7.846s          3.003s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.330s          3.671s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   8.129s          3.065s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  21.457s          3.548s

Off-topic: we may want to add support for empty bags at some points, if we can do it without significant overhead

colesbury · 2018-03-12T20:16:06Z

I like the idea of speeding up EmbeddingBag, but I'd rather not expose a new public indexing operation. Can we move it to inside EmbeddingBag.cpp?

martinraison · 2018-03-14T10:49:21Z

@colesbury If we move the implementation inside EmbeddingBag, I assume all the calls to select, cadd, etc will have to go through a dynamic layer to figure out the tensor type (because the code isn't generic anymore). Since the number of such calls is O(index_size), that seems a bit wasteful. I haven't measured the impact though, it just felt less elegant overall. Happy to give it a try if I can find the time.

martinraison · 2018-03-14T13:23:58Z

Update: I tried moving the logic to EmbeddingBag.cpp with the following function

static void index_select_add(const Tensor &select_indices,
                             const Tensor &add_indices,
                             const Tensor &src,
                             Tensor &output) {
  auto add_indices_data = add_indices.data<int64_t>();
  auto select_indices_data = select_indices.data<int64_t>();
  auto numel = add_indices.numel();
  for (int64_t i = 0; i < numel; i++) {
    output[add_indices_data[i]] += src[select_indices_data[i]];
  }
}

However this slows things down dramatically (the resulting code is slower than even the original code by a factor of 2 or 3). Am I missing anything or is it just because of the dynamic wrapping? Any easy way to speed this up? (I saw that there are "accessors", but it doesn't seem to allow iterating over tensor slices easily).

martinraison · 2018-03-14T15:11:16Z

Update 2: I managed to get a faster version by calling the blas primitives directly (like we already do for the backward pass). The implementation is a bit more specialized but it seems like a good trade-off in this case. In fact it's even faster than the first version I posted, here's the new timing I have for the forward pass:

NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   1.391s          1.015s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.569s          1.291s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   1.788s          1.068s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  11.023s          1.304s

aten/src/ATen/TensorUtils.h

 void checkSameNumel(CheckedFrom c, const TensorGeometryArg& t1, const TensorGeometryArg& t2);
 void checkAllSameNumel(CheckedFrom c, ArrayRef<TensorArg> tensors);
 void checkScalarType(CheckedFrom c, const TensorArg& t, ScalarType s);
+void checkScalarTypes(CheckedFrom c, const TensorArg& t, std::initializer_list<ScalarType> l);


aten/src/ATen/native/EmbeddingBag.cpp

  Tensor indices = indices__.contiguous();
  Tensor offsets = offsets__.contiguous();
+  auto weight_arg = TensorArg(weight, "weight", 1);
+  checkScalarTypes("embedding_bag", weight_arg, {kFloat, kDouble});


colesbury

lgtm

aten/src/ATen/native/EmbeddingBag.cpp

-        THDoubleBlas_axpy(ddim, (double)scale, gd + ddim * source, 1,
-                          igwd + ddim * index, 1);
+        axpy<double>(ddim, (double)scale, gd + ddim * source, 1,
+                     igwd + ddim * index, 1);


martinraison · 2018-03-14T16:37:29Z

@colesbury The test failure seems unrelated to my changes. Should we rerun?

zou3519 · 2018-03-15T14:53:46Z

@martinraison yes that test is a flaky one.

zou3519 · 2018-03-15T14:54:11Z

@pytorchbot retest this please

onnxbot-worker-3 mentioned this pull request Feb 27, 2018

[auto] pytorch-pr-5433 onnxbot/onnx-fb-universe#854

Closed

martinraison force-pushed the master branch from 7c7da92 to 21e6f6f Compare February 28, 2018 13:24

Martin Raison added 2 commits March 14, 2018 08:58

speed up CPU EmbeddingBag (indexSelectAdd op)

71d38cb

keep operator inside EmbeddingBag + speedup

05f20b8

martinraison force-pushed the master branch from 1abedda to 05f20b8 Compare March 14, 2018 15:08

comment

c97baa5

colesbury reviewed Mar 14, 2018

View reviewed changes

update checkScalarTypes signature

4a6d97a

colesbury approved these changes Mar 14, 2018

View reviewed changes

colesbury reviewed Mar 14, 2018

View reviewed changes

enforce type in embedding_bag_backward_cpu

daa1c07

soumith merged commit c40b99f into pytorch:master Mar 15, 2018

ezyang added the open source label Jun 24, 2019

speed up CPU EmbeddingBag (indexSelectAdd op) #5433

speed up CPU EmbeddingBag (indexSelectAdd op) #5433

Uh oh!

Conversation

martinraison commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colesbury commented Mar 12, 2018

Uh oh!

martinraison commented Mar 14, 2018

Uh oh!

martinraison commented Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinraison commented Mar 14, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

martinraison commented Mar 14, 2018

Uh oh!

zou3519 commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Mar 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

martinraison commented Feb 27, 2018 •

edited

Loading

martinraison commented Mar 14, 2018 •

edited

Loading

zou3519 commented Mar 15, 2018 •

edited

Loading