speed up torch.sparse_mask() cpu kernel #13290

weiyangfb · 2018-10-30T03:17:59Z

sparse_mask(D, S) is useful to implement backward for sparse_addmm()
previous sparse_mask(D, S) cpu kernel is not parallelized
this PR speed up the cpu kernel for two separated cases:
- D.dim == S.sparse_dim: simply parallelize the kernel
- D.dim > S.sparse_dim: simply use CUDA kernel implementation
performance:

D.dim == S.sparse_dim

>>> nnz = 100000
>>> dims = [1000, 1000]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), 
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz)
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)

>>> %timeit D.sparse_mask(S)

======= before change =======
6.4 ms ± 684 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

======= after change =======
333 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

D.dim > S.sparse_dim

>>> nnz = 100000
>>> dims = [1000, 1000, 2, 2]
>>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), 
               torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz, dims[2], dims[3])
>>> size = torch.Size(dims)

>>> S = torch.sparse_coo_tensor(I, V, size).coalesce()
>>> D = torch.randn(dims)
%timeit D.sparse_mask(S)

======= before change =======
495 ms ± 41.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

======= after change =======
594 µs ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

aten/src/ATen/native/sparse/SparseTensor.cpp

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

weiyangfb · 2018-11-02T17:46:59Z

how can I further improve this PR? cc @ssnl @ezyang

torch/_tensor_docs.py

ezyang · 2018-11-05T16:42:39Z

D.dim > S.sparse_dim: simply use CUDA kernel implementation

I'm not sure what you mean by this.

aten/src/ATen/native/sparse/SparseTensor.cpp

ezyang

Good work, thank you.

weiyangfb · 2018-11-05T18:08:45Z

@ezyang I meant for code in condition: D.dim > S.sparse_dim, I just copy & paste from

pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cpp

Lines 33 to 51 in 482b136

    
           LongTensor indices = at::zeros({mask._nnz()}, mask_indices.options()); 
        
           for (int64_t d = 0; d < mask.sparse_dim(); d++) { 
        
             indices.mul_(mask.size(d)); 
        
             // This used to use a buffer but I deoptimized it 
        
             indices.add_(mask_indices.select(0, d)); 
        
           } 
        
           std::vector<int64_t> view_size(1 + mask.dense_dim()); 
        
           view_size[0] = -1; 
        
           for (int64_t d = 0; d < mask.dense_dim(); d++) { 
        
             view_size[d + 1] = mask.size(mask.sparse_dim() + d); 
        
           } 
        
           Tensor t_view = t.view(view_size); 
        
           // TODO: Re-audit this; it used to be an indexSelect directly into r_values 
        
           at::index_select_out(r_values, t_view, 0, indices); 
        
           return r;

. I feel bad about duplicating code. Let me know if this is not ok.

ezyang · 2018-11-07T21:25:29Z

It's OK, I wouldn't block th epatch on it.

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…, copy over CUDA kernel implementation

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: - `sparse_mask(D, S)` is useful to implement backward for `sparse_addmm()` - previous `sparse_mask(D, S)` cpu kernel is not parallelized - this PR speed up the cpu kernel for two separated cases: - `D.dim == S.sparse_dim`: simply parallelize the kernel - `D.dim > S.sparse_dim`: simply use CUDA kernel implementation - performance: `D.dim == S.sparse_dim` ``` >>> nnz = 100000 >>> dims = [1000, 1000] >>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz) >>> V = torch.randn(nnz) >>> size = torch.Size(dims) >>> S = torch.sparse_coo_tensor(I, V, size).coalesce() >>> D = torch.randn(dims) >>> %timeit D.sparse_mask(S) ======= before change ======= 6.4 ms ± 684 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ======= after change ======= 333 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` `D.dim > S.sparse_dim` ``` >>> nnz = 100000 >>> dims = [1000, 1000, 2, 2] >>> I = torch.cat([torch.randint(0, dims[0], size=(nnz,)), torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz) >>> V = torch.randn(nnz, dims[2], dims[3]) >>> size = torch.Size(dims) >>> S = torch.sparse_coo_tensor(I, V, size).coalesce() >>> D = torch.randn(dims) %timeit D.sparse_mask(S) ======= before change ======= 495 ms ± 41.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ======= after change ======= 594 µs ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: pytorch/pytorch#13290 Differential Revision: D12878336 Pulled By: weiyangfb fbshipit-source-id: 10b5981af382f7c6095a42c0fee7297d6438ce37

ssnl reviewed Oct 30, 2018

View reviewed changes

aten/src/ATen/native/sparse/SparseTensor.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

facebook-github-bot reviewed Nov 1, 2018

View reviewed changes

ezyang reviewed Nov 5, 2018

View reviewed changes

torch/_tensor_docs.py Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Nov 5, 2018

View reviewed changes

aten/src/ATen/native/sparse/SparseTensor.cpp Outdated

This comment was marked as off-topic.

Sign in to view

ezyang approved these changes Nov 5, 2018

View reviewed changes

facebook-github-bot reviewed Nov 7, 2018

View reviewed changes

weiyangfb force-pushed the sparse_mask_parallelize_cpu branch from 5920e0a to 7e8d4e1 Compare November 7, 2018 22:31

facebook-github-bot reviewed Nov 7, 2018

View reviewed changes

weiyangfb added 3 commits November 7, 2018 15:15

1) parallelize for case of dim == sparse_dim; 2) for dim > sparse_dim…

772b5f1

…, copy over CUDA kernel implementation

add some notes and doc

fb4d372

address comments

e12531f

weiyangfb force-pushed the sparse_mask_parallelize_cpu branch from 7e8d4e1 to e12531f Compare November 7, 2018 23:15

facebook-github-bot reviewed Nov 7, 2018

View reviewed changes

facebook-github-bot closed this in 5dd153b Nov 8, 2018

ezyang added the merged label Jun 25, 2019

speed up torch.sparse_mask() cpu kernel #13290

speed up torch.sparse_mask() cpu kernel #13290

Uh oh!

Conversation

weiyangfb commented Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

weiyangfb commented Nov 2, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Nov 5, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

weiyangfb commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Nov 7, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weiyangfb commented Oct 30, 2018 •

edited

Loading

weiyangfb commented Nov 5, 2018 •

edited

Loading