-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Port mse_lose to ATen #26529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port mse_lose to ATen #26529
Conversation
|
Can you please add benchmarks with Also GPU numbers looks concerning to me, I will check code accurately for the slow down reason. |
|
The following is the perfomance data for How to run? Before: After: |
|
I will also see the reason of GPU performance degradation. Thanks! |
|
@VitalyFedyunin, do you any finding about the GPU degration? Using same inputs, I test the toch.sub computer time, this time is about 70% of origin mse_loss compute time, perharps there has big overhead when port sub to Aten(#8919). So do we compare the aten sub with THC sub for large input size? |
|
You need to fuse operations to get comparable performance (CPU and CUDA). |
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to fuse operations together to get similar/better performance.
aten/src/ATen/native/Loss.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As example this is 2 separate CPU loops or CUDA kernels, one is for sub, another is for pow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CUDA loop does sub and pow inside same kernel.
aten/src/THNN/generic/MSECriterion.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CPU implementation does sub and pow inside of single loop.
|
@VitalyFedyunin @zou3519, the backward of L1_loss and mse_loss also have performance overhead which call sub and mul in two loops, there two method to fuse them: using TensorIterator to implement a ternary kenel: sub_mul, or just use parallel_for, which one is acceptabled to you? Thanks! |
aten/src/ATen/native/Loss.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch/torch/nn/functional.py
Lines 2183 to 2205 in 8fb756d
| def mse_loss(input, target, size_average=None, reduce=None, reduction='mean'): | |
| # type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor | |
| r"""mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor | |
| Measures the element-wise mean squared error. | |
| See :class:`~torch.nn.MSELoss` for details. | |
| """ | |
| if not (target.size() == input.size()): | |
| warnings.warn("Using a target size ({}) that is different to the input size ({}). " | |
| "This will likely lead to incorrect results due to broadcasting. " | |
| "Please ensure they have the same size.".format(target.size(), input.size()), | |
| stacklevel=2) | |
| if size_average is not None or reduce is not None: | |
| reduction = _Reduction.legacy_get_string(size_average, reduce) | |
| if target.requires_grad: | |
| ret = (input - target) ** 2 | |
| if reduction != 'none': | |
| ret = torch.mean(ret) if reduction == 'mean' else torch.sum(ret) | |
| else: | |
| expanded_input, expanded_target = torch.broadcast_tensors(input, target) | |
| ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction)) | |
| return ret |
I think we can delete the broadcast_tensors call now after this change.
|
The following is the performance after fusing operators: After using fused operators, the overhead is decreased to 0.02 ms for GPU forward, but there has performance improvement for other cases. |
aten/src/ATen/native/Loss.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 , there have a problem for mse_backward if delete broadcast_tensors, for input.size=(2, 1), target.size =(2,10), the norm number should be 20, but get 2 in there. So I didn't remove this broadcast_tensors method now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see... It's because we have a custom backward implementation for it. That's fine for now then, we can work on improving mse_loss separately after the port, thank you for pointing that out.
|
@VitalyFedyunin, please help review. |
|
@VitalyFedyunin, Can you find some people to review the code, I have rebased it too many times. 😞 |
Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed #26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: #27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed pytorch/pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: pytorch/pytorch#27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
|
@VitalyFedyunin, please help review the code, thanks! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Looks good, can you please update PR descriptions with most recent benchmark numbers. |
Benchmark numbers are updated, thanks! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
VitalyFedyunin, This PR is about port mse lose to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time
def _time():
if torch.cuda.is_available():
torch.cuda.synchronize()
return time.time()
device = "cpu"
loss = nn.MSELoss(reduction = 'sum')
if torch.cuda.is_available():
device = "cuda"
loss = loss.cuda()
#warm up
for n in [100, 10000]:
input = torch.randn(128, n, requires_grad=True, device=device)
target = torch.randn(128, n, device=device)
for i in range(1000):
output = loss(input, target)
output.backward()
#get running time
for n in [100, 10000]:
fwd_t = 0
bwd_t = 0
input = torch.randn(128, n, requires_grad=True, device=device)
target = torch.randn(128, n, device=device)
for i in range(10000):
t1 = _time()
output = loss(input, target)
t2 = _time()
output.backward()
t3 = _time()
fwd_t = fwd_t + (t2 -t1)
bwd_t = bwd_t + (t3 - t2)
fwd_avg = fwd_t / 10000 * 1000
bwd_avg = bwd_t / 10000 * 1000
print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
% (n, fwd_avg, bwd_avg))
```
**Test Device:** CPU: skx-8180, GPU: Tesla P40.
### Perfromance:
**Before:**
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.21 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.09 (ms); backwad avg time is 0.15 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.20 (ms).
CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).
OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.41 (ms); backwad avg time is 1.66 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.44 (ms); backwad avg time is 1.68 (ms).
```
**After:**
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).
CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.30 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time :qis 0.13 (ms); backwad avg time is 0.30 (ms).
OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.27 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).
```
Pull Request resolved: pytorch/pytorch#26529
Differential Revision: D18225144
Pulled By: VitalyFedyunin
fbshipit-source-id: ce837a297c70398a3ffa22f26ee9e812cf60d128
|
@VitalyFedyunin merged this pull request in c8771f5. |
) Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: pytorch#27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
@VitalyFedyunin, This PR is about port mse lose to Aten:
Test script:
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
After: