-
Notifications
You must be signed in to change notification settings - Fork 26.3k
speed up CPU EmbeddingBag (indexSelectAdd op) #5433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I like the idea of speeding up EmbeddingBag, but I'd rather not expose a new public indexing operation. Can we move it to inside EmbeddingBag.cpp? |
|
@colesbury If we move the implementation inside EmbeddingBag, I assume all the calls to |
|
Update: I tried moving the logic to EmbeddingBag.cpp with the following function static void index_select_add(const Tensor &select_indices,
const Tensor &add_indices,
const Tensor &src,
Tensor &output) {
auto add_indices_data = add_indices.data<int64_t>();
auto select_indices_data = select_indices.data<int64_t>();
auto numel = add_indices.numel();
for (int64_t i = 0; i < numel; i++) {
output[add_indices_data[i]] += src[select_indices_data[i]];
}
}However this slows things down dramatically (the resulting code is slower than even the original code by a factor of 2 or 3). Am I missing anything or is it just because of the dynamic wrapping? Any easy way to speed this up? (I saw that there are "accessors", but it doesn't seem to allow iterating over tensor slices easily). |
|
Update 2: I managed to get a faster version by calling the blas primitives directly (like we already do for the backward pass). The implementation is a bit more specialized but it seems like a good trade-off in this case. In fact it's even faster than the first version I posted, here's the new timing I have for the forward pass: |
aten/src/ATen/TensorUtils.h
Outdated
| void checkSameNumel(CheckedFrom c, const TensorGeometryArg& t1, const TensorGeometryArg& t2); | ||
| void checkAllSameNumel(CheckedFrom c, ArrayRef<TensorArg> tensors); | ||
| void checkScalarType(CheckedFrom c, const TensorArg& t, ScalarType s); | ||
| void checkScalarTypes(CheckedFrom c, const TensorArg& t, std::initializer_list<ScalarType> l); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| Tensor indices = indices__.contiguous(); | ||
| Tensor offsets = offsets__.contiguous(); | ||
| auto weight_arg = TensorArg(weight, "weight", 1); | ||
| checkScalarTypes("embedding_bag", weight_arg, {kFloat, kDouble}); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| THDoubleBlas_axpy(ddim, (double)scale, gd + ddim * source, 1, | ||
| igwd + ddim * index, 1); | ||
| axpy<double>(ddim, (double)scale, gd + ddim * source, 1, | ||
| igwd + ddim * index, 1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@colesbury The test failure seems unrelated to my changes. Should we rerun? |
|
@martinraison yes that test is a flaky one. |
|
@pytorchbot retest this please |
As discussed with @cpuhrsch, this should speed up the CPU version of EmbeddingBag in a few cases. The main idea is to avoid creating a large intermediary tensor during the forward pass, using the new
index_select_add_operation (which fusesindex_selectandindex_add_).I also slightly optimized the backward to replace a bunch of divisions by a few divisions and a bunch of multiplications.
Benchmark results on one CPU, for the forward pass only:
The benchmark code is the same as in #4856 except that I removed the backward pass and the dense tests.
I also tried including the backward pass and multiple CPUs. Perhaps surprisingly, I didn't see any significant change in that scenario, although in my original "real life" scenario, the new code makes overall training about 30% faster. I haven't yet found what's different in the benchmark setup.
Off-topic: we may want to add support for empty bags at some points, if we can do it without significant overhead