Skip to content

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Jan 18, 2019

Partial fix for #15804, only w/o dim.
For @jcjohnson benchmarking script I'm getting the following results on V100:
Before:

unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms

After

Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms

@soumith
Copy link
Contributor

soumith commented Jan 18, 2019

before I land this, could you please locally check correctness as well, against CPU.

The unit-tests only cover small_3d test cases.

@soumith soumith added this to the 1.0.1 milestone Jan 18, 2019
@jcjohnson
Copy link
Contributor

This is amazing -- thank you so much @ngimel for the quick fix!

@ngimel
Copy link
Collaborator Author

ngimel commented Jan 18, 2019

@soumith tested for Justin's benchmark's cases, all clean. Do you want some tests added to the test suite?

import torch
for N in [10000, 100000, 500000]:
   for M in [N, 10 * N]:
      print('Running with N = %d, M = %d' % (N, M))
      a = torch.randint(M, size=(N,))
      unique, indices = a.unique(return_inverse=True)
      a=a.cuda()
      unique_gpu, indices_gpu = a.unique(return_inverse=True)
      unique = unique.cuda()
      indices = indices.cuda()
      print((unique-unique_gpu).abs().max(), (indices - indices_gpu).abs().max())

@soumith
Copy link
Contributor

soumith commented Jan 18, 2019

@ngimel your testing sounds good. and there are smoke tests with small_3d matrices in test_torch / test_cuda which are hopefully sufficient. If you see anything worth adding as a unit-test, do send a follow-up PR, I'll land this now

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jan 18, 2019
Summary:
Partial fix for #15804, only w/o dim.
For jcjohnson benchmarking script I'm getting the following results on V100:
Before:
```
unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms
```
After
```
Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms
```
Pull Request resolved: pytorch/pytorch#16145

Differential Revision: D13738821

Pulled By: soumith

fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
@gkioxari
Copy link

gkioxari commented Jan 23, 2019

@ngimel @jcjohnson Is it possible to have the unique_index be returned as well, when the appropriate flag is turned on? This would be a similar functionality to np.unique()

@ngimel
Copy link
Collaborator Author

ngimel commented Jan 23, 2019

Yes, it's possible, my original commit was returning unique_index instead of inverse_index, so just restore thrust::unique_by_key call with appropriate handling of options? 883edac

@ngimel ngimel deleted the unique branch January 23, 2019 23:11
soumith pushed a commit that referenced this pull request Jan 24, 2019
Summary:
Partial fix for #15804, only w/o dim.
For jcjohnson benchmarking script I'm getting the following results on V100:
Before:
```
unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms
```
After
```
Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms
```
Pull Request resolved: #16145

Differential Revision: D13738821

Pulled By: soumith

fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
@soumith soumith added the cherry-picked This PR was cherry-picked onto a release branch from master label Jan 24, 2019
soumith pushed a commit that referenced this pull request Jan 29, 2019
Summary:
Partial fix for #15804, only w/o dim.
For jcjohnson benchmarking script I'm getting the following results on V100:
Before:
```
unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms
```
After
```
Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms
```
Pull Request resolved: #16145

Differential Revision: D13738821

Pulled By: soumith

fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
theweiho added a commit to theweiho/translate that referenced this pull request May 16, 2019
Summary:
pytorch/pytorch#8899 had added CUDA support for `torch.unique()`

pytorch/pytorch#16145 has some timing stats that could be relevant

 ---

Experiment results: https://fb.quip.com/olQOA853j0mb
Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x
Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x

Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this.

Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly?

Differential Revision: D15073468

fbshipit-source-id: a9710738c827013afb35a67bd3a9be259b0e2d5f
theweiho added a commit to theweiho/translate that referenced this pull request May 16, 2019
…torch#537)

Summary:
Pull Request resolved: pytorch#537

pytorch/pytorch#8899 had added CUDA support for `torch.unique()`

pytorch/pytorch#16145 has some timing stats that could be relevant

 ---

Experiment results: https://fb.quip.com/olQOA853j0mb
Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x
Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x

Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this.

Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly?

Differential Revision: D15073468

fbshipit-source-id: 713288fc7c77f582840f270dd2e343a3b63f8fe5
theweiho added a commit to theweiho/translate that referenced this pull request May 16, 2019
…torch#537)

Summary:
Pull Request resolved: pytorch#537

pytorch/pytorch#8899 had added CUDA support for `torch.unique()`

pytorch/pytorch#16145 has some timing stats that could be relevant

 ---

Experiment results: https://fb.quip.com/olQOA853j0mb
Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x
Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x

Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this.

Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly?

Differential Revision: D15073468

fbshipit-source-id: 29c7eaaddd63d629866c7314920fe27b22690603
facebook-github-bot pushed a commit to pytorch/translate that referenced this pull request May 17, 2019
Summary:
Pull Request resolved: #537

pytorch/pytorch#8899 had added CUDA support for `torch.unique()`

pytorch/pytorch#16145 has some timing stats that could be relevant

 ---

Experiment results: https://fb.quip.com/olQOA853j0mb
Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x
Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x

Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this.

Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly?

Reviewed By: akinh, jmp84

Differential Revision: D15073468

fbshipit-source-id: c2dba562b6d4fb4d15d2a56d03ce6a6e3ddff07d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked This PR was cherry-picked onto a release branch from master open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants