Skip to content

Conversation

@weiyangfb
Copy link
Contributor

@weiyangfb weiyangfb commented Oct 30, 2018

  • sparse_mask(D, S) is useful to implement backward for sparse_addmm()
  • previous sparse_mask(D, S) cpu kernel is not parallelized
  • this PR speed up the cpu kernel for two separated cases:
    • D.dim == S.sparse_dim: simply parallelize the kernel
    • D.dim > S.sparse_dim: simply use CUDA kernel implementation
  • performance:

D.dim == S.sparse_dim

>>> nnz = 100000
>>> dims = [1000, 1000]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), 
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz)
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)

>>> %timeit D.sparse_mask(S)

======= before change =======
6.4 ms ± 684 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

======= after change =======
333 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

D.dim > S.sparse_dim

>>> nnz = 100000
>>> dims = [1000, 1000, 2, 2]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), 
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz, dims[2], dims[3])
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)
%timeit D.sparse_mask(S)

======= before change =======
495 ms ± 41.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

======= after change =======
594 µs ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@weiyangfb
Copy link
Contributor Author

how can I further improve this PR? cc @ssnl @ezyang

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor

ezyang commented Nov 5, 2018

D.dim > S.sparse_dim: simply use CUDA kernel implementation

I'm not sure what you mean by this.

This comment was marked as off-topic.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, thank you.

@weiyangfb
Copy link
Contributor Author

weiyangfb commented Nov 5, 2018

@ezyang I meant for code in condition: D.dim > S.sparse_dim, I just copy & paste from

LongTensor indices = at::zeros({mask._nnz()}, mask_indices.options());
for (int64_t d = 0; d < mask.sparse_dim(); d++) {
indices.mul_(mask.size(d));
// This used to use a buffer but I deoptimized it
indices.add_(mask_indices.select(0, d));
}
std::vector<int64_t> view_size(1 + mask.dense_dim());
view_size[0] = -1;
for (int64_t d = 0; d < mask.dense_dim(); d++) {
view_size[d + 1] = mask.size(mask.sparse_dim() + d);
}
Tensor t_view = t.view(view_size);
// TODO: Re-audit this; it used to be an indexSelect directly into r_values
at::index_select_out(r_values, t_view, 0, indices);
return r;
. I feel bad about duplicating code. Let me know if this is not ok.

@ezyang
Copy link
Contributor

ezyang commented Nov 7, 2018

It's OK, I wouldn't block th epatch on it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@weiyangfb weiyangfb force-pushed the sparse_mask_parallelize_cpu branch from 5920e0a to 7e8d4e1 Compare November 7, 2018 22:31
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@weiyangfb weiyangfb force-pushed the sparse_mask_parallelize_cpu branch from 7e8d4e1 to e12531f Compare November 7, 2018 23:15
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 8, 2018
Summary:
- `sparse_mask(D, S)` is useful to implement backward for `sparse_addmm()`
- previous `sparse_mask(D, S)` cpu kernel is not parallelized
- this PR speed up the cpu kernel for two separated cases:
  - `D.dim == S.sparse_dim`: simply parallelize the kernel
  - `D.dim > S.sparse_dim`: simply use CUDA kernel implementation
- performance:

`D.dim == S.sparse_dim`
```
>>> nnz = 100000
>>> dims = [1000, 1000]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)),
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz)
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)

>>> %timeit D.sparse_mask(S)

======= before change =======
6.4 ms ± 684 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

======= after change =======
333 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

`D.dim > S.sparse_dim`
```
>>> nnz = 100000
>>> dims = [1000, 1000, 2, 2]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)),
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz, dims[2], dims[3])
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)
%timeit D.sparse_mask(S)

======= before change =======
495 ms ± 41.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

======= after change =======
594 µs ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: pytorch/pytorch#13290

Differential Revision: D12878336

Pulled By: weiyangfb

fbshipit-source-id: 10b5981af382f7c6095a42c0fee7297d6438ce37
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants