Port mse_lose to ATen #26529

XiaobingSuper · 2019-09-20T07:51:18Z

@VitalyFedyunin, This PR is about port mse lose to Aten：

Test script:

import torch
import torch.nn as nn
import time

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
loss = nn.MSELoss(reduction = 'sum')
if torch.cuda.is_available():
    device = "cuda"
    loss = loss.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(1000):
        output = loss(input, target)
        output.backward()

#get running time
for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = loss(input, target)
        t2 = _time()
        output.backward()
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))

Test Device: CPU: skx-8180, GPU: Tesla P40.

Perfromance:

Before:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.21 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.09 (ms); backwad avg time is 0.15 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.41 (ms); backwad avg time is 1.66 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.44 (ms); backwad avg time is 1.68 (ms).

After:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

CPU:
OMP_NUM_THREADS=56
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.30 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time :qis 0.13 (ms); backwad avg time is 0.30 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.27 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).

VitalyFedyunin · 2019-09-20T15:24:38Z

Can you please add benchmarks with OMP_NUM_THREADS=1?

Also GPU numbers looks concerning to me, I will check code accurately for the slow down reason.

VitalyFedyunin · 2019-09-20T15:26:07Z

#24598 Migrate mse_loss from the TH to Aten (CUDA)
#24599 Migrate mse_loss_backward from the TH to Aten (CUDA)
#24732 Migrate mse_loss from the TH to Aten (CPU)
#24733 Migrate mse_loss_backward from the TH to Aten (CPU)

XiaobingSuper · 2019-09-21T15:24:40Z

The following is the perfomance data for OMP_NUM_THREADS=1, the run script is :

num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script

How to run? ./run.sh 1 mse_cpu.py

Before:

reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.52 (ms); backwad avg time is 1.83 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.53 (ms); backwad avg time is 1.85 (ms).

After:

reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.04 (ms); backwad avg time is 1.68 (ms).

reduction=’sum’
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 1.04 (ms); backwad avg time is 1.69 (ms).

XiaobingSuper · 2019-09-21T15:27:10Z

I will also see the reason of GPU performance degradation. Thanks!

XiaobingSuper · 2019-09-24T14:29:00Z

@VitalyFedyunin, do you any finding about the GPU degration? Using same inputs, I test the toch.sub computer time, this time is about 70% of origin mse_loss compute time, perharps there has big overhead when port sub to Aten(#8919). So do we compare the aten sub with THC sub for large input size?

VitalyFedyunin · 2019-09-25T19:36:42Z

You need to fuse operations to get comparable performance (CPU and CUDA).

VitalyFedyunin

You need to fuse operations together to get similar/better performance.

VitalyFedyunin · 2019-09-25T19:37:25Z

aten/src/ATen/native/Loss.cpp

As example this is 2 separate CPU loops or CUDA kernels, one is for sub, another is for pow.

VitalyFedyunin · 2019-09-25T19:37:53Z

aten/src/THCUNN/generic/MSECriterion.cu

This CUDA loop does sub and pow inside same kernel.

VitalyFedyunin · 2019-09-25T19:38:19Z

aten/src/THNN/generic/MSECriterion.c

This CPU implementation does sub and pow inside of single loop.

XiaobingSuper · 2019-09-26T09:00:41Z

@VitalyFedyunin @zou3519, the backward of L1_loss and mse_loss also have performance overhead which call sub and mul in two loops, there two method to fuse them: using TensorIterator to implement a ternary kenel: sub_mul, or just use parallel_for, which one is acceptabled to you? Thanks!

zou3519 · 2019-09-26T14:30:55Z

aten/src/ATen/native/Loss.cpp

pytorch/torch/nn/functional.py

Lines 2183 to 2205 in 8fb756d

def mse_loss(input, target, size_average=None, reduce=None, reduction='mean'):

# type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor

r"""mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor

Measures the element-wise mean squared error.

See :class:`~torch.nn.MSELoss` for details.

"""

if not (target.size() == input.size()):

warnings.warn("Using a target size ({}) that is different to the input size ({}). "

"This will likely lead to incorrect results due to broadcasting. "

"Please ensure they have the same size.".format(target.size(), input.size()),

stacklevel=2)

if size_average is not None or reduce is not None:

reduction = _Reduction.legacy_get_string(size_average, reduce)

if target.requires_grad:

ret = (input - target) ** 2

if reduction != 'none':

ret = torch.mean(ret) if reduction == 'mean' else torch.sum(ret)

else:

expanded_input, expanded_target = torch.broadcast_tensors(input, target)

ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))

return ret

I think we can delete the broadcast_tensors call now after this change.

XiaobingSuper · 2019-09-27T07:02:53Z

The following is the performance after fusing operators:

GPU:
reduction=’mean’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.15 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.14 (ms).

OMP_NUM_THREADS=1
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.26 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms).

After using fused operators, the overhead is decreased to 0.02 ms for GPU forward, but there has performance improvement for other cases.

XiaobingSuper · 2019-09-27T07:22:27Z

aten/src/ATen/native/Loss.cpp

@zou3519 , there have a problem for mse_backward if delete broadcast_tensors, for input.size=(2, 1), target.size =(2,10), the norm number should be 20, but get 2 in there. So I didn't remove this broadcast_tensors method now.

Oh, I see... It's because we have a custom backward implementation for it. That's fine for now then, we can work on improving mse_loss separately after the port, thank you for pointing that out.

XiaobingSuper · 2019-10-08T01:41:25Z

@VitalyFedyunin @zou3519

XiaobingSuper · 2019-10-12T01:20:28Z

@VitalyFedyunin, please help review.

XiaobingSuper · 2019-10-14T01:00:22Z

@VitalyFedyunin

XiaobingSuper · 2019-10-17T00:51:52Z

@VitalyFedyunin, Can you find some people to review the code, I have rebased it too many times. 😞

XiaobingSuper · 2019-10-23T06:10:16Z

@VitalyFedyunin

Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed #26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: #27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db

Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed pytorch/pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: pytorch/pytorch#27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db

XiaobingSuper · 2019-10-28T07:28:49Z

@VitalyFedyunin, please help review the code, thanks!

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-10-30T15:31:55Z

Looks good, can you please update PR descriptions with most recent benchmark numbers.

XiaobingSuper · 2019-10-31T01:49:02Z

Looks good, can you please update PR descriptions with most recent benchmark numbers.

Benchmark numbers are updated, thanks!

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: VitalyFedyunin, This PR is about port mse lose to Aten： **Test script:** ``` import torch import torch.nn as nn import time def _time(): if torch.cuda.is_available(): torch.cuda.synchronize() return time.time() device = "cpu" loss = nn.MSELoss(reduction = 'sum') if torch.cuda.is_available(): device = "cuda" loss = loss.cuda() #warm up for n in [100, 10000]: input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(1000): output = loss(input, target) output.backward() #get running time for n in [100, 10000]: fwd_t = 0 bwd_t = 0 input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(10000): t1 = _time() output = loss(input, target) t2 = _time() output.backward() t3 = _time() fwd_t = fwd_t + (t2 -t1) bwd_t = bwd_t + (t3 - t2) fwd_avg = fwd_t / 10000 * 1000 bwd_avg = bwd_t / 10000 * 1000 print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)." % (n, fwd_avg, bwd_avg)) ``` **Test Device:** CPU: skx-8180, GPU: Tesla P40. ### Perfromance: **Before:** ``` GPU: reduction=’mean’ input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.14 (ms). input size(128, 10000) forward time is 0.12 (ms); backwad avg time is 0.21 (ms). reduction=’sum’ input size(128, 100) forward time is 0.09 (ms); backwad avg time is 0.15 (ms). input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.20 (ms). CPU: OMP_NUM_THREADS=56 reduction=’mean’ input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms). reduction=’sum’ input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 3.49 (ms); backwad avg time is 3.23 (ms). OMP_NUM_THREADS=1 reduction=’mean’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms). input size(128, 10000) forward time is 1.41 (ms); backwad avg time is 1.66 (ms). reduction=’sum’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms). input size(128, 10000) forward time is 1.44 (ms); backwad avg time is 1.68 (ms). ``` **After:** ``` GPU: reduction=’mean’ input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.13 (ms). input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms). reduction=’sum’ input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.14 (ms). input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 0.20 (ms). CPU: OMP_NUM_THREADS=56 reduction=’mean’ input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms). input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.30 (ms). reduction=’sum’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.06 (ms). input size(128, 10000) forward time :qis 0.13 (ms); backwad avg time is 0.30 (ms). OMP_NUM_THREADS=1 reduction=’mean’ input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.85 (ms); backwad avg time is 1.27 (ms). reduction=’sum’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.04 (ms). input size(128, 10000) forward time is 0.83 (ms); backwad avg time is 1.27 (ms). ``` Pull Request resolved: pytorch/pytorch#26529 Differential Revision: D18225144 Pulled By: VitalyFedyunin fbshipit-source-id: ce837a297c70398a3ffa22f26ee9e812cf60d128

facebook-github-bot · 2019-11-01T05:43:37Z

@VitalyFedyunin merged this pull request in c8771f5.

) Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed pytorch#26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: pytorch#27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db

pytorchbot added module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Sep 20, 2019

ezyang added the open source label Sep 20, 2019

VitalyFedyunin self-requested a review September 20, 2019 15:20

VitalyFedyunin suggested changes Sep 25, 2019

View reviewed changes

yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2019

zou3519 reviewed Sep 26, 2019

View reviewed changes

XiaobingSuper force-pushed the mse branch from 0edbce1 to 7bd1e07 Compare September 27, 2019 06:52

XiaobingSuper force-pushed the mse branch from 7bd1e07 to 45ded38 Compare September 27, 2019 07:12

XiaobingSuper commented Sep 27, 2019

View reviewed changes

XiaobingSuper force-pushed the mse branch from 45ded38 to df7085a Compare October 8, 2019 01:40

XiaobingSuper force-pushed the mse branch from df7085a to cec4429 Compare October 8, 2019 03:02

XiaobingSuper mentioned this pull request Oct 14, 2019

Port l1_loss to Aten #26795

Closed

andreaskoepf mentioned this pull request Oct 15, 2019

Migrate smooth_l1_loss from the TH to Aten (CPU & CUDA) #27962

Closed

XiaobingSuper force-pushed the mse branch from cec4429 to b5ff945 Compare October 17, 2019 11:08

XiaobingSuper force-pushed the mse branch from b5ff945 to e562734 Compare October 23, 2019 06:07

XiaobingSuper closed this Oct 25, 2019

XiaobingSuper reopened this Oct 28, 2019

XiaobingSuper added 2 commits October 28, 2019 15:18

port mse_lose to ATen

2de625f

imporve mse_loss performance by fusing some operations

dc3b467

XiaobingSuper force-pushed the mse branch from e562734 to dc3b467 Compare October 28, 2019 07:27

XiaobingSuper requested a review from VitalyFedyunin October 29, 2019 12:20

facebook-github-bot reviewed Oct 30, 2019

View reviewed changes

VitalyFedyunin approved these changes Oct 30, 2019

View reviewed changes

facebook-github-bot reviewed Oct 31, 2019

View reviewed changes

VitalyFedyunin approved these changes Oct 31, 2019

View reviewed changes

facebook-github-bot closed this in c8771f5 Oct 31, 2019

facebook-github-bot added the merged label Nov 1, 2019

XiaobingSuper deleted the mse branch November 1, 2019 05:53

XiaobingSuper mentioned this pull request Dec 16, 2019

BCELossWithLogits(input) != BCELoss(Sigmoid(input)) #24933

Closed

mruberry added the Merged label Oct 28, 2020

	def mse_loss(input, target, size_average=None, reduce=None, reduction='mean'):
	# type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor
	r"""mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor

	Measures the element-wise mean squared error.

	See :class:`~torch.nn.MSELoss` for details.
	"""
	if not (target.size() == input.size()):
	warnings.warn("Using a target size ({}) that is different to the input size ({}). "
	"This will likely lead to incorrect results due to broadcasting. "
	"Please ensure they have the same size.".format(target.size(), input.size()),
	stacklevel=2)
	if size_average is not None or reduce is not None:
	reduction = _Reduction.legacy_get_string(size_average, reduce)
	if target.requires_grad:
	ret = (input - target) ** 2
	if reduction != 'none':
	ret = torch.mean(ret) if reduction == 'mean' else torch.sum(ret)
	else:
	expanded_input, expanded_target = torch.broadcast_tensors(input, target)
	ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
	return ret

Port mse_lose to ATen #26529

Port mse_lose to ATen #26529

Uh oh!

Conversation

XiaobingSuper commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perfromance:

Uh oh!

VitalyFedyunin commented Sep 20, 2019

Uh oh!

VitalyFedyunin commented Sep 20, 2019

Uh oh!

XiaobingSuper commented Sep 21, 2019

Uh oh!

XiaobingSuper commented Sep 21, 2019

Uh oh!

XiaobingSuper commented Sep 24, 2019

Uh oh!

VitalyFedyunin commented Sep 25, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

XiaobingSuper commented Sep 26, 2019

Uh oh!

zou3519 Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

XiaobingSuper commented Sep 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XiaobingSuper Sep 27, 2019

Choose a reason for hiding this comment

Uh oh!

zou3519 Sep 27, 2019

Choose a reason for hiding this comment

Uh oh!

XiaobingSuper commented Oct 8, 2019

Uh oh!

XiaobingSuper commented Oct 12, 2019

Uh oh!

XiaobingSuper commented Oct 14, 2019

Uh oh!

XiaobingSuper commented Oct 17, 2019

Uh oh!

XiaobingSuper commented Oct 23, 2019

Uh oh!

XiaobingSuper commented Oct 28, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Oct 30, 2019

Uh oh!

XiaobingSuper commented Oct 31, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

XiaobingSuper commented Sep 20, 2019 •

edited

Loading

XiaobingSuper commented Sep 27, 2019 •

edited

Loading