Skip to content

Conversation

@martinraison
Copy link
Contributor

@martinraison martinraison commented Feb 27, 2018

As discussed with @cpuhrsch, this should speed up the CPU version of EmbeddingBag in a few cases. The main idea is to avoid creating a large intermediary tensor during the forward pass, using the new index_select_add_ operation (which fuses index_select and index_add_).
I also slightly optimized the backward to replace a bunch of divisions by a few divisions and a bunch of multiplications.

Benchmark results on one CPU, for the forward pass only:

  • Original code
$ NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.349s          1.048s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  14.371s          1.269s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.385s          1.074s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.947s          1.254s
  • New code
$ NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   4.921s          1.037s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   9.256s          1.288s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   4.929s          1.073s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  14.210s          1.440s

The benchmark code is the same as in #4856 except that I removed the backward pass and the dense tests.

I also tried including the backward pass and multiple CPUs. Perhaps surprisingly, I didn't see any significant change in that scenario, although in my original "real life" scenario, the new code makes overall training about 30% faster. I haven't yet found what's different in the benchmark setup.

  • Original code
$ NUMEXPR_NUM_THREADS=8 MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 taskset -c 0-7 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   7.875s          2.946s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.310s          3.715s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   8.180s          3.122s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  21.451s          3.562s
  • New code
$ NUMEXPR_NUM_THREADS=8 MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 taskset -c 0-7 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   7.846s          3.003s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  18.330s          3.671s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   8.129s          3.065s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  21.457s          3.548s

Off-topic: we may want to add support for empty bags at some points, if we can do it without significant overhead

@colesbury
Copy link
Member

I like the idea of speeding up EmbeddingBag, but I'd rather not expose a new public indexing operation. Can we move it to inside EmbeddingBag.cpp?

@martinraison
Copy link
Contributor Author

@colesbury If we move the implementation inside EmbeddingBag, I assume all the calls to select, cadd, etc will have to go through a dynamic layer to figure out the tensor type (because the code isn't generic anymore). Since the number of such calls is O(index_size), that seems a bit wasteful. I haven't measured the impact though, it just felt less elegant overall. Happy to give it a try if I can find the time.

@martinraison
Copy link
Contributor Author

martinraison commented Mar 14, 2018

Update: I tried moving the logic to EmbeddingBag.cpp with the following function

static void index_select_add(const Tensor &select_indices,
                             const Tensor &add_indices,
                             const Tensor &src,
                             Tensor &output) {
  auto add_indices_data = add_indices.data<int64_t>();
  auto select_indices_data = select_indices.data<int64_t>();
  auto numel = add_indices.numel();
  for (int64_t i = 0; i < numel; i++) {
    output[add_indices_data[i]] += src[select_indices_data[i]];
  }
}

However this slows things down dramatically (the resulting code is slower than even the original code by a factor of 2 or 3). Am I missing anything or is it just because of the dynamic wrapping? Any easy way to speed this up? (I saw that there are "accessors", but it doesn't seem to allow iterating over tensor slices easily).

@martinraison
Copy link
Contributor Author

Update 2: I managed to get a faster version by calling the blas primitives directly (like we already do for the backward pass). The implementation is a bit more specialized but it seems like a good trade-off in this case. In fact it's even faster than the first version I posted, here's the new timing I have for the forward pass:

NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 taskset -c 0 python benchmark.py

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
runs: 10000     number of indices: 2000 maximum number of bags: 200     maximum bag size: 30

====================================================================================================
dimension:      10000   x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   1.391s          1.015s

====================================================================================================
dimension:      10000   x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   5.569s          1.291s

====================================================================================================
dimension:      100000  x       100
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
   1.788s          1.068s

====================================================================================================
dimension:      100000  x       1000
----------------------------------------------------------------------------------------------------
cpu sparse      cuda sparse
  11.023s          1.304s

void checkSameNumel(CheckedFrom c, const TensorGeometryArg& t1, const TensorGeometryArg& t2);
void checkAllSameNumel(CheckedFrom c, ArrayRef<TensorArg> tensors);
void checkScalarType(CheckedFrom c, const TensorArg& t, ScalarType s);
void checkScalarTypes(CheckedFrom c, const TensorArg& t, std::initializer_list<ScalarType> l);

This comment was marked as off-topic.

Tensor indices = indices__.contiguous();
Tensor offsets = offsets__.contiguous();
auto weight_arg = TensorArg(weight, "weight", 1);
checkScalarTypes("embedding_bag", weight_arg, {kFloat, kDouble});

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

THDoubleBlas_axpy(ddim, (double)scale, gd + ddim * source, 1,
igwd + ddim * index, 1);
axpy<double>(ddim, (double)scale, gd + ddim * source, 1,
igwd + ddim * index, 1);

This comment was marked as off-topic.

This comment was marked as off-topic.

@martinraison
Copy link
Contributor Author

@colesbury The test failure seems unrelated to my changes. Should we rerun?

@zou3519
Copy link
Contributor

zou3519 commented Mar 15, 2018

@martinraison yes that test is a flaky one.

@zou3519
Copy link
Contributor

zou3519 commented Mar 15, 2018

@pytorchbot retest this please

@soumith soumith merged commit c40b99f into pytorch:master Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants