Skip to content

Conversation

@XiaobingSuper
Copy link
Collaborator

@XiaobingSuper XiaobingSuper commented Sep 20, 2019

@VitalyFedyunin, This PR is about port mse lose to Aten:

Test script:

import torch
import torch.nn as nn
import time

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
loss = nn.MSELoss(reduction = 'sum')
if torch.cuda.is_available():
    device = "cuda"
    loss = loss.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(1000):
        output = loss(input, target)
        output.backward()

#get running time
for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = loss(input, target)
        t2 = _time()
        output.backward()
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))

Test Device: CPU: skx-8180, GPU: Tesla P40.

Perfromance:

Before:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.21 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.09 (ms); backwad avg time is 0.15 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.41 (ms); backwad avg time is 1.66 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.44 (ms); backwad avg time is 1.68 (ms).

After:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.30 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time :qis 0.13 (ms); backwad avg time is 0.30 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.27 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).

@pytorchbot pytorchbot added module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Sep 20, 2019
@VitalyFedyunin VitalyFedyunin self-requested a review September 20, 2019 15:20
@VitalyFedyunin
Copy link
Contributor

Can you please add benchmarks with OMP_NUM_THREADS=1?

Also GPU numbers looks concerning to me, I will check code accurately for the slow down reason.

@VitalyFedyunin
Copy link
Contributor

#24598 Migrate mse_loss from the TH to Aten (CUDA)
#24599 Migrate mse_loss_backward from the TH to Aten (CUDA)
#24732 Migrate mse_loss from the TH to Aten (CPU)
#24733 Migrate mse_loss_backward from the TH to Aten (CPU)

@XiaobingSuper
Copy link
Collaborator Author

The following is the perfomance data for OMP_NUM_THREADS=1, the run script is :

num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script

How to run? ./run.sh 1 mse_cpu.py

Before:

reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.52 (ms); backwad avg time is 1.83 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.53 (ms); backwad avg time is 1.85 (ms).

After:

reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.04 (ms); backwad avg time is 1.68 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.04 (ms); backwad avg time is 1.69 (ms).

@XiaobingSuper
Copy link
Collaborator Author

I will also see the reason of GPU performance degradation. Thanks!

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, do you any finding about the GPU degration? Using same inputs, I test the toch.sub computer time, this time is about 70% of origin mse_loss compute time, perharps there has big overhead when port sub to Aten(#8919). So do we compare the aten sub with THC sub for large input size?

@VitalyFedyunin
Copy link
Contributor

You need to fuse operations to get comparable performance (CPU and CUDA).

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to fuse operations together to get similar/better performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As example this is 2 separate CPU loops or CUDA kernels, one is for sub, another is for pow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CUDA loop does sub and pow inside same kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CPU implementation does sub and pow inside of single loop.

@yf225 yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2019
@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin @zou3519, the backward of L1_loss and mse_loss also have performance overhead which call sub and mul in two loops, there two method to fuse them: using TensorIterator to implement a ternary kenel: sub_mul, or just use parallel_for, which one is acceptabled to you? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch/torch/nn/functional.py

Lines 2183 to 2205 in 8fb756d

def mse_loss(input, target, size_average=None, reduce=None, reduction='mean'):
# type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor
r"""mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor
Measures the element-wise mean squared error.
See :class:`~torch.nn.MSELoss` for details.
"""
if not (target.size() == input.size()):
warnings.warn("Using a target size ({}) that is different to the input size ({}). "
"This will likely lead to incorrect results due to broadcasting. "
"Please ensure they have the same size.".format(target.size(), input.size()),
stacklevel=2)
if size_average is not None or reduce is not None:
reduction = _Reduction.legacy_get_string(size_average, reduce)
if target.requires_grad:
ret = (input - target) ** 2
if reduction != 'none':
ret = torch.mean(ret) if reduction == 'mean' else torch.sum(ret)
else:
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
return ret

I think we can delete the broadcast_tensors call now after this change.

@XiaobingSuper
Copy link
Collaborator Author

XiaobingSuper commented Sep 27, 2019

The following is the performance after fusing operators:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.15 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.14 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.26 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).

After using fused operators, the overhead is decreased to 0.02 ms for GPU forward, but there has performance improvement for other cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zou3519 , there have a problem for mse_backward if delete broadcast_tensors, for input.size=(2, 1), target.size =(2,10), the norm number should be 20, but get 2 in there. So I didn't remove this broadcast_tensors method now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see... It's because we have a custom backward implementation for it. That's fine for now then, we can work on improving mse_loss separately after the port, thank you for pointing that out.

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin @zou3519

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, please help review.

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, Can you find some people to review the code, I have rebased it too many times. 😞

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin

facebook-github-bot pushed a commit that referenced this pull request Oct 23, 2019
Summary:
This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH
implementation (see benchmark results).

### Questions:
1. Is the storage location of the implementation ok (I followed #26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)?
2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl?

## WITH patch benchmark result:
```
CPU warmup 1000 took 0.00018124299822375178
CPU warmup 10000 took 0.00021713999740313739
CPU warmup 100000 took 0.0016273759974865243
CPU warmup TOTAL time 0.0020758909959113225
CPU forward 1000 took 6.229899736354128e-05
CPU forward 10000 took 0.00013340599980438128
CPU forward 100000 took 0.0008730469999136403
CPU forward 1000000 took 0.011010036003426649
CPU forward 10000000 took 0.11133221499767387
CPU forward 100000000 took 1.0425375220002024
CPU forward TOTAL time 1.1660894790038583
CPU for- & backward 1000 took 0.0002662249971763231
CPU for- & backward 10000 took 0.00023712700203759596
CPU for- & backward 100000 took 0.002531945996452123
CPU for- & backward 1000000 took 0.010394354998425115
CPU for- & backward 10000000 took 0.23814761800167616
CPU for- & backward 100000000 took 1.2651235049997922
CPU for- & backward TOTAL time 1.516897434994462

GPU warmup 1000 took 0.00020941899856552482
GPU warmup 10000 took 8.128300396492705e-05
GPU warmup 100000 took 8.551499922759831e-05
GPU warmup TOTAL time 0.0004199420000077225
GPU forward 1000 took 7.060499774524942e-05
GPU forward 10000 took 7.116600318113342e-05
GPU forward 100000 took 9.825800225371495e-05
GPU forward 1000000 took 0.000499356996442657
GPU forward 10000000 took 0.002032470001722686
GPU forward 100000000 took 0.018638986002770253
GPU forward TOTAL time 0.02148268099699635
GPU for- & backward 1000 took 0.00035967300209449604
GPU for- & backward 10000 took 0.00032710300001781434
GPU for- & backward 100000 took 0.0003689270015456714
GPU for- & backward 1000000 took 0.0007732619997113943
GPU for- & backward 10000000 took 0.02127284000016516
GPU for- & backward 100000000 took 0.2022330649997457
GPU for- & backward TOTAL time 0.2254496300010942
```

## WITHOUT patch benchmark result:
```
CPU warmup 1000 took 0.00011545199959073216
CPU warmup 10000 took 0.00016227000014623627
CPU warmup 100000 took 0.0013456509987008758
CPU warmup TOTAL time 0.001648657998885028
CPU forward 1000 took 2.627600042615086e-05
CPU forward 10000 took 0.00015939700097078457
CPU forward 100000 took 0.001139313004387077
CPU forward 1000000 took 0.013769682998827193
CPU forward 10000000 took 0.13163026500114938
CPU forward 100000000 took 1.321879123999679
CPU forward TOTAL time 1.4687001089987461
CPU for- & backward 1000 took 0.0002569290008977987
CPU for- & backward 10000 took 0.00033315900509478524
CPU for- & backward 100000 took 0.0016096779945655726
CPU for- & backward 1000000 took 0.014474845003860537
CPU for- & backward 10000000 took 0.1564881520025665
CPU for- & backward 100000000 took 1.5787935900007142
CPU for- & backward TOTAL time 1.7521004869995522

GPU warmup 1000 took 0.00025611399905756116
GPU warmup 10000 took 0.00014123699656920508
GPU warmup 100000 took 0.00012580600014189258
GPU warmup TOTAL time 0.0005591579974861816
GPU forward 1000 took 0.00031183200189843774
GPU forward 10000 took 0.00011483799607958645
GPU forward 100000 took 0.00010807999933604151
GPU forward 1000000 took 0.0007842139966669492
GPU forward 10000000 took 0.0017624700049054809
GPU forward 100000000 took 0.01519905700115487
GPU forward TOTAL time 0.018341148999752477
GPU for- & backward 1000 took 0.00047569099842803553
GPU for- & backward 10000 took 0.0003539700046530925
GPU for- & backward 100000 took 0.000808880002296064
GPU for- & backward 1000000 took 0.001639469999645371
GPU for- & backward 10000000 took 0.021154599002329633
GPU for- & backward 100000000 took 0.19268552300491137
GPU for- & backward TOTAL time 0.2172460189976846
```

### Code used for perforrmance testing
```
import torch
import torch.nn.functional as F
import torch.nn as nn

from timeit import default_timer

torch.manual_seed(0)
cpu = torch.device('cpu')
gpu = torch.device('cuda')

loss_fn = F.smooth_l1_loss

def run_benchmark(name, depth, require_grad, device, fn):
    total_start = default_timer()
    y = None
    a = None
    for i in range(3, 3 + depth):
        start = default_timer()
        n = 10 ** i
        a = torch.rand(n, requires_grad=require_grad, device=device)
        b = torch.rand(n, device=device)
        y = fn(a, b)
        y.cpu() # get result (potentially wait for gpu)
        if a.grad is not None:
            a.grad.cpu()
        end = default_timer()
        print('{} {} took {}'.format(name, n, end-start))
    total_end = default_timer()
    print('{} TOTAL time {}'.format(name, total_end-total_start))

def fwd_only(a, b):
    out = loss_fn(a, b)
    return out

def fwd_bck(a, b):
    out = loss_fn(a, b)
    out.backward()
    return out

def sanity_check(name, device):
    print('{} Operator sanity check:'.format(name))
    a = torch.randn(16, requires_grad=True, device=device)
    b = torch.randn(16, device=device) * 2
    out = loss_fn(a, b)
    print('out', out)
    out.backward()
    print(a.grad)
    print('double backward')
    loss = loss_fn(a, b)
    loss2 = torch.autograd.grad(loss, a, create_graph=True)
    z = loss2[0].sum()
    print(z)
    z.backward()
    print('ok')
    print()

print('PyTorch version:', torch.__version__)
sanity_check('CPU', cpu)
if torch.cuda.is_available():
    sanity_check('GPU', gpu)
print()

run_benchmark('CPU warmup', 3, False, cpu, fwd_only)
run_benchmark('CPU forward', 6, False, cpu, fwd_only)
run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck)
print()

if torch.cuda.is_available():
    run_benchmark('GPU warmup', 3, False, gpu, fwd_only)
    run_benchmark('GPU forward', 6, False, gpu, fwd_only)
    run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck)
```
Pull Request resolved: #27962

Differential Revision: D18061942

Pulled By: ezyang

fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 23, 2019
Summary:
This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH
implementation (see benchmark results).

### Questions:
1. Is the storage location of the implementation ok (I followed pytorch/pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)?
2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl?

## WITH patch benchmark result:
```
CPU warmup 1000 took 0.00018124299822375178
CPU warmup 10000 took 0.00021713999740313739
CPU warmup 100000 took 0.0016273759974865243
CPU warmup TOTAL time 0.0020758909959113225
CPU forward 1000 took 6.229899736354128e-05
CPU forward 10000 took 0.00013340599980438128
CPU forward 100000 took 0.0008730469999136403
CPU forward 1000000 took 0.011010036003426649
CPU forward 10000000 took 0.11133221499767387
CPU forward 100000000 took 1.0425375220002024
CPU forward TOTAL time 1.1660894790038583
CPU for- & backward 1000 took 0.0002662249971763231
CPU for- & backward 10000 took 0.00023712700203759596
CPU for- & backward 100000 took 0.002531945996452123
CPU for- & backward 1000000 took 0.010394354998425115
CPU for- & backward 10000000 took 0.23814761800167616
CPU for- & backward 100000000 took 1.2651235049997922
CPU for- & backward TOTAL time 1.516897434994462

GPU warmup 1000 took 0.00020941899856552482
GPU warmup 10000 took 8.128300396492705e-05
GPU warmup 100000 took 8.551499922759831e-05
GPU warmup TOTAL time 0.0004199420000077225
GPU forward 1000 took 7.060499774524942e-05
GPU forward 10000 took 7.116600318113342e-05
GPU forward 100000 took 9.825800225371495e-05
GPU forward 1000000 took 0.000499356996442657
GPU forward 10000000 took 0.002032470001722686
GPU forward 100000000 took 0.018638986002770253
GPU forward TOTAL time 0.02148268099699635
GPU for- & backward 1000 took 0.00035967300209449604
GPU for- & backward 10000 took 0.00032710300001781434
GPU for- & backward 100000 took 0.0003689270015456714
GPU for- & backward 1000000 took 0.0007732619997113943
GPU for- & backward 10000000 took 0.02127284000016516
GPU for- & backward 100000000 took 0.2022330649997457
GPU for- & backward TOTAL time 0.2254496300010942
```

## WITHOUT patch benchmark result:
```
CPU warmup 1000 took 0.00011545199959073216
CPU warmup 10000 took 0.00016227000014623627
CPU warmup 100000 took 0.0013456509987008758
CPU warmup TOTAL time 0.001648657998885028
CPU forward 1000 took 2.627600042615086e-05
CPU forward 10000 took 0.00015939700097078457
CPU forward 100000 took 0.001139313004387077
CPU forward 1000000 took 0.013769682998827193
CPU forward 10000000 took 0.13163026500114938
CPU forward 100000000 took 1.321879123999679
CPU forward TOTAL time 1.4687001089987461
CPU for- & backward 1000 took 0.0002569290008977987
CPU for- & backward 10000 took 0.00033315900509478524
CPU for- & backward 100000 took 0.0016096779945655726
CPU for- & backward 1000000 took 0.014474845003860537
CPU for- & backward 10000000 took 0.1564881520025665
CPU for- & backward 100000000 took 1.5787935900007142
CPU for- & backward TOTAL time 1.7521004869995522

GPU warmup 1000 took 0.00025611399905756116
GPU warmup 10000 took 0.00014123699656920508
GPU warmup 100000 took 0.00012580600014189258
GPU warmup TOTAL time 0.0005591579974861816
GPU forward 1000 took 0.00031183200189843774
GPU forward 10000 took 0.00011483799607958645
GPU forward 100000 took 0.00010807999933604151
GPU forward 1000000 took 0.0007842139966669492
GPU forward 10000000 took 0.0017624700049054809
GPU forward 100000000 took 0.01519905700115487
GPU forward TOTAL time 0.018341148999752477
GPU for- & backward 1000 took 0.00047569099842803553
GPU for- & backward 10000 took 0.0003539700046530925
GPU for- & backward 100000 took 0.000808880002296064
GPU for- & backward 1000000 took 0.001639469999645371
GPU for- & backward 10000000 took 0.021154599002329633
GPU for- & backward 100000000 took 0.19268552300491137
GPU for- & backward TOTAL time 0.2172460189976846
```

### Code used for perforrmance testing
```
import torch
import torch.nn.functional as F
import torch.nn as nn

from timeit import default_timer

torch.manual_seed(0)
cpu = torch.device('cpu')
gpu = torch.device('cuda')

loss_fn = F.smooth_l1_loss

def run_benchmark(name, depth, require_grad, device, fn):
    total_start = default_timer()
    y = None
    a = None
    for i in range(3, 3 + depth):
        start = default_timer()
        n = 10 ** i
        a = torch.rand(n, requires_grad=require_grad, device=device)
        b = torch.rand(n, device=device)
        y = fn(a, b)
        y.cpu() # get result (potentially wait for gpu)
        if a.grad is not None:
            a.grad.cpu()
        end = default_timer()
        print('{} {} took {}'.format(name, n, end-start))
    total_end = default_timer()
    print('{} TOTAL time {}'.format(name, total_end-total_start))

def fwd_only(a, b):
    out = loss_fn(a, b)
    return out

def fwd_bck(a, b):
    out = loss_fn(a, b)
    out.backward()
    return out

def sanity_check(name, device):
    print('{} Operator sanity check:'.format(name))
    a = torch.randn(16, requires_grad=True, device=device)
    b = torch.randn(16, device=device) * 2
    out = loss_fn(a, b)
    print('out', out)
    out.backward()
    print(a.grad)
    print('double backward')
    loss = loss_fn(a, b)
    loss2 = torch.autograd.grad(loss, a, create_graph=True)
    z = loss2[0].sum()
    print(z)
    z.backward()
    print('ok')
    print()

print('PyTorch version:', torch.__version__)
sanity_check('CPU', cpu)
if torch.cuda.is_available():
    sanity_check('GPU', gpu)
print()

run_benchmark('CPU warmup', 3, False, cpu, fwd_only)
run_benchmark('CPU forward', 6, False, cpu, fwd_only)
run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck)
print()

if torch.cuda.is_available():
    run_benchmark('GPU warmup', 3, False, gpu, fwd_only)
    run_benchmark('GPU forward', 6, False, gpu, fwd_only)
    run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck)
```
Pull Request resolved: pytorch/pytorch#27962

Differential Revision: D18061942

Pulled By: ezyang

fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
@XiaobingSuper XiaobingSuper reopened this Oct 28, 2019
@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, please help review the code, thanks!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Looks good, can you please update PR descriptions with most recent benchmark numbers.

@XiaobingSuper
Copy link
Collaborator Author

Looks good, can you please update PR descriptions with most recent benchmark numbers.

Benchmark numbers are updated, thanks!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 31, 2019
Summary:
VitalyFedyunin, This PR is about port mse lose to Aten:

**Test script:**
```
import torch
import torch.nn as nn
import time

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
loss = nn.MSELoss(reduction = 'sum')
if torch.cuda.is_available():
    device = "cuda"
    loss = loss.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(1000):
        output = loss(input, target)
        output.backward()

#get running time
for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = loss(input, target)
        t2 = _time()
        output.backward()
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
**Test Device:** CPU: skx-8180, GPU: Tesla P40.

### Perfromance:

**Before:**
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.21 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.09 (ms); backwad avg time is 0.15 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.41 (ms); backwad avg time is 1.66 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.44 (ms); backwad avg time is 1.68 (ms).
```

**After:**
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.30 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time :qis 0.13 (ms); backwad avg time is 0.30 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.27 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).
```
Pull Request resolved: pytorch/pytorch#26529

Differential Revision: D18225144

Pulled By: VitalyFedyunin

fbshipit-source-id: ce837a297c70398a3ffa22f26ee9e812cf60d128
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in c8771f5.

@XiaobingSuper XiaobingSuper deleted the mse branch November 1, 2019 05:53
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
)

Summary:
This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH
implementation (see benchmark results).

### Questions:
1. Is the storage location of the implementation ok (I followed pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)?
2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl?

## WITH patch benchmark result:
```
CPU warmup 1000 took 0.00018124299822375178
CPU warmup 10000 took 0.00021713999740313739
CPU warmup 100000 took 0.0016273759974865243
CPU warmup TOTAL time 0.0020758909959113225
CPU forward 1000 took 6.229899736354128e-05
CPU forward 10000 took 0.00013340599980438128
CPU forward 100000 took 0.0008730469999136403
CPU forward 1000000 took 0.011010036003426649
CPU forward 10000000 took 0.11133221499767387
CPU forward 100000000 took 1.0425375220002024
CPU forward TOTAL time 1.1660894790038583
CPU for- & backward 1000 took 0.0002662249971763231
CPU for- & backward 10000 took 0.00023712700203759596
CPU for- & backward 100000 took 0.002531945996452123
CPU for- & backward 1000000 took 0.010394354998425115
CPU for- & backward 10000000 took 0.23814761800167616
CPU for- & backward 100000000 took 1.2651235049997922
CPU for- & backward TOTAL time 1.516897434994462

GPU warmup 1000 took 0.00020941899856552482
GPU warmup 10000 took 8.128300396492705e-05
GPU warmup 100000 took 8.551499922759831e-05
GPU warmup TOTAL time 0.0004199420000077225
GPU forward 1000 took 7.060499774524942e-05
GPU forward 10000 took 7.116600318113342e-05
GPU forward 100000 took 9.825800225371495e-05
GPU forward 1000000 took 0.000499356996442657
GPU forward 10000000 took 0.002032470001722686
GPU forward 100000000 took 0.018638986002770253
GPU forward TOTAL time 0.02148268099699635
GPU for- & backward 1000 took 0.00035967300209449604
GPU for- & backward 10000 took 0.00032710300001781434
GPU for- & backward 100000 took 0.0003689270015456714
GPU for- & backward 1000000 took 0.0007732619997113943
GPU for- & backward 10000000 took 0.02127284000016516
GPU for- & backward 100000000 took 0.2022330649997457
GPU for- & backward TOTAL time 0.2254496300010942
```

## WITHOUT patch benchmark result:
```
CPU warmup 1000 took 0.00011545199959073216
CPU warmup 10000 took 0.00016227000014623627
CPU warmup 100000 took 0.0013456509987008758
CPU warmup TOTAL time 0.001648657998885028
CPU forward 1000 took 2.627600042615086e-05
CPU forward 10000 took 0.00015939700097078457
CPU forward 100000 took 0.001139313004387077
CPU forward 1000000 took 0.013769682998827193
CPU forward 10000000 took 0.13163026500114938
CPU forward 100000000 took 1.321879123999679
CPU forward TOTAL time 1.4687001089987461
CPU for- & backward 1000 took 0.0002569290008977987
CPU for- & backward 10000 took 0.00033315900509478524
CPU for- & backward 100000 took 0.0016096779945655726
CPU for- & backward 1000000 took 0.014474845003860537
CPU for- & backward 10000000 took 0.1564881520025665
CPU for- & backward 100000000 took 1.5787935900007142
CPU for- & backward TOTAL time 1.7521004869995522

GPU warmup 1000 took 0.00025611399905756116
GPU warmup 10000 took 0.00014123699656920508
GPU warmup 100000 took 0.00012580600014189258
GPU warmup TOTAL time 0.0005591579974861816
GPU forward 1000 took 0.00031183200189843774
GPU forward 10000 took 0.00011483799607958645
GPU forward 100000 took 0.00010807999933604151
GPU forward 1000000 took 0.0007842139966669492
GPU forward 10000000 took 0.0017624700049054809
GPU forward 100000000 took 0.01519905700115487
GPU forward TOTAL time 0.018341148999752477
GPU for- & backward 1000 took 0.00047569099842803553
GPU for- & backward 10000 took 0.0003539700046530925
GPU for- & backward 100000 took 0.000808880002296064
GPU for- & backward 1000000 took 0.001639469999645371
GPU for- & backward 10000000 took 0.021154599002329633
GPU for- & backward 100000000 took 0.19268552300491137
GPU for- & backward TOTAL time 0.2172460189976846
```

### Code used for perforrmance testing
```
import torch
import torch.nn.functional as F
import torch.nn as nn

from timeit import default_timer

torch.manual_seed(0)
cpu = torch.device('cpu')
gpu = torch.device('cuda')

loss_fn = F.smooth_l1_loss

def run_benchmark(name, depth, require_grad, device, fn):
    total_start = default_timer()
    y = None
    a = None
    for i in range(3, 3 + depth):
        start = default_timer()
        n = 10 ** i
        a = torch.rand(n, requires_grad=require_grad, device=device)
        b = torch.rand(n, device=device)
        y = fn(a, b)
        y.cpu() # get result (potentially wait for gpu)
        if a.grad is not None:
            a.grad.cpu()
        end = default_timer()
        print('{} {} took {}'.format(name, n, end-start))
    total_end = default_timer()
    print('{} TOTAL time {}'.format(name, total_end-total_start))

def fwd_only(a, b):
    out = loss_fn(a, b)
    return out

def fwd_bck(a, b):
    out = loss_fn(a, b)
    out.backward()
    return out

def sanity_check(name, device):
    print('{} Operator sanity check:'.format(name))
    a = torch.randn(16, requires_grad=True, device=device)
    b = torch.randn(16, device=device) * 2
    out = loss_fn(a, b)
    print('out', out)
    out.backward()
    print(a.grad)
    print('double backward')
    loss = loss_fn(a, b)
    loss2 = torch.autograd.grad(loss, a, create_graph=True)
    z = loss2[0].sum()
    print(z)
    z.backward()
    print('ok')
    print()

print('PyTorch version:', torch.__version__)
sanity_check('CPU', cpu)
if torch.cuda.is_available():
    sanity_check('GPU', gpu)
print()

run_benchmark('CPU warmup', 3, False, cpu, fwd_only)
run_benchmark('CPU forward', 6, False, cpu, fwd_only)
run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck)
print()

if torch.cuda.is_available():
    run_benchmark('GPU warmup', 3, False, gpu, fwd_only)
    run_benchmark('GPU forward', 6, False, gpu, fwd_only)
    run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck)
```
Pull Request resolved: pytorch#27962

Differential Revision: D18061942

Pulled By: ezyang

fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants