-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Unique cuda support #8899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique cuda support #8899
Conversation
|
I don't see a synchronization call ( |
|
@ngimel Thanks for your remind. I already changed the code and result. |
|
ROCm build OOMed: |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Inverse index algorithm is not "great" but it seems to be good enough.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@yueyericardo If you want to unblock this for merge, I'd advise preprocessoring out the implementation when ROCm is building (so we'll have unique for CUDA but not ROCm). Use |
|
@ezyang I already implemented your advise.Thank you. |
|
Great, thanks! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
Add cuda support for unique.
There is a simple test below for a tensor including 1M <int> data.
And the performance is faster.
```python
Performance
cpu: 0.05040597915649414 s
x: tensor([1, 3, 1, ..., 4, 9, 4])
x output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])
x inverse: tensor([0, 2, 0, ..., 3, 8, 3])
gpu: 0.015192985534667969 s
y: tensor([1, 3, 1, ..., 4, 9, 4], device='cuda:0')
y output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9], device='cuda:0')
y inverse: tensor([0, 2, 0, ..., 3, 8, 3], device='cuda:0')
```
```python
Code
import torch
import time
x=torch.randint(1,10,(1000000,),dtype=torch.long)
device = torch.device("cuda")
y=x.to(device)
start = time.time();
output,inverse = x.unique(sorted=True,return_inverse=True)
stop = time.time();
print('cpu:',stop-start,'s')
print('x:',x)
print('x output:',output)
print('x inverse:',inverse)
start = time.time();
output1,inverse1 = y.unique(sorted=True,return_inverse=True)
torch.cuda.synchronize();
stop = time.time();
print('gpu:',stop-start,'s')
print('y:',y)
print('y output:',output1)
print('y inverse:',inverse1)
```
Closes pytorch/pytorch#8899
Reviewed By: SsnL
Differential Revision: D8677655
Pulled By: ezyang
fbshipit-source-id: 09df3f0602f235c5d36c7a6e7e1d89dbf82570bb
Summary:
Add cuda support for unique.
There is a simple test below for a tensor including 1M <int> data.
And the performance is faster.
```python
Performance
cpu: 0.05040597915649414 s
x: tensor([1, 3, 1, ..., 4, 9, 4])
x output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])
x inverse: tensor([0, 2, 0, ..., 3, 8, 3])
gpu: 0.015192985534667969 s
y: tensor([1, 3, 1, ..., 4, 9, 4], device='cuda:0')
y output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9], device='cuda:0')
y inverse: tensor([0, 2, 0, ..., 3, 8, 3], device='cuda:0')
```
```python
Code
import torch
import time
x=torch.randint(1,10,(1000000,),dtype=torch.long)
device = torch.device("cuda")
y=x.to(device)
start = time.time();
output,inverse = x.unique(sorted=True,return_inverse=True)
stop = time.time();
print('cpu:',stop-start,'s')
print('x:',x)
print('x output:',output)
print('x inverse:',inverse)
start = time.time();
output1,inverse1 = y.unique(sorted=True,return_inverse=True)
torch.cuda.synchronize();
stop = time.time();
print('gpu:',stop-start,'s')
print('y:',y)
print('y output:',output1)
print('y inverse:',inverse1)
```
Closes pytorch/pytorch#8899
Reviewed By: SsnL
Differential Revision: D8677655
Pulled By: ezyang
fbshipit-source-id: 09df3f0602f235c5d36c7a6e7e1d89dbf82570bb
Summary:
Add cuda support for unique.
There is a simple test below for a tensor including 1M <int> data.
And the performance is faster.
```python
Performance
cpu: 0.05040597915649414 s
x: tensor([1, 3, 1, ..., 4, 9, 4])
x output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])
x inverse: tensor([0, 2, 0, ..., 3, 8, 3])
gpu: 0.015192985534667969 s
y: tensor([1, 3, 1, ..., 4, 9, 4], device='cuda:0')
y output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9], device='cuda:0')
y inverse: tensor([0, 2, 0, ..., 3, 8, 3], device='cuda:0')
```
```python
Code
import torch
import time
x=torch.randint(1,10,(1000000,),dtype=torch.long)
device = torch.device("cuda")
y=x.to(device)
start = time.time();
output,inverse = x.unique(sorted=True,return_inverse=True)
stop = time.time();
print('cpu:',stop-start,'s')
print('x:',x)
print('x output:',output)
print('x inverse:',inverse)
start = time.time();
output1,inverse1 = y.unique(sorted=True,return_inverse=True)
torch.cuda.synchronize();
stop = time.time();
print('gpu:',stop-start,'s')
print('y:',y)
print('y output:',output1)
print('y inverse:',inverse1)
```
Closes pytorch#8899
Reviewed By: SsnL
Differential Revision: D8677655
Pulled By: ezyang
fbshipit-source-id: 09df3f0602f235c5d36c7a6e7e1d89dbf82570bb
Summary: pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: a9710738c827013afb35a67bd3a9be259b0e2d5f
…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 713288fc7c77f582840f270dd2e343a3b63f8fe5
…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 29c7eaaddd63d629866c7314920fe27b22690603
Summary: Pull Request resolved: #537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Reviewed By: akinh, jmp84 Differential Revision: D15073468 fbshipit-source-id: c2dba562b6d4fb4d15d2a56d03ce6a6e3ddff07d
Add cuda support for unique.
There is a simple test below for a tensor including 1M data.
And the performance is faster.