-
Notifications
You must be signed in to change notification settings - Fork 26.3k
adding a beta parameter to the smooth_l1 loss fn #44433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e3544b9
3b1566e
9f2b3e5
0b1ce81
fb555c3
75d4e70
30a2a80
2f96843
9433550
9df04b7
8043711
aa0227d
dc531c7
c593e98
ba28ddd
773b079
8f107e2
0c3ab5a
5e8f8ed
3261419
b736fed
fd6d110
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,10 +11,11 @@ struct TensorIterator; | |
| namespace native { | ||
|
|
||
| using pointwise_fn = void (*)(TensorIterator&, Scalar scalar); | ||
| using pointwise_fn_beta = void (*)(TensorIterator&, Scalar scalar, double beta); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lmk if you think the name should be more general, or if I just shouldn't use an alias here. Right now smooth_l1_backwards looks like the only point wise op here that takes in two scalar params |
||
|
|
||
| DECLARE_DISPATCH(pointwise_fn, addcmul_stub); | ||
| DECLARE_DISPATCH(pointwise_fn, addcdiv_stub); | ||
| DECLARE_DISPATCH(pointwise_fn, smooth_l1_backward_stub); | ||
| DECLARE_DISPATCH(pointwise_fn_beta, smooth_l1_backward_stub); | ||
| DECLARE_DISPATCH(pointwise_fn, mse_backward_stub); | ||
|
|
||
| } // namespace native | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -46,28 +46,39 @@ static void addcdiv_cpu_kernel(TensorIterator& iter, Scalar value) { | |
| }); | ||
| } | ||
|
|
||
| static void smooth_l1_backward_cpu_kernel(TensorIterator& iter, Scalar norm) { | ||
| static void smooth_l1_backward_cpu_kernel(TensorIterator& iter, Scalar norm, double beta) { | ||
| ScalarType dtype = iter.dtype(0); | ||
| AT_DISPATCH_ALL_TYPES(dtype, "smooth_l1_backward_cpu_out", [&] { | ||
| auto norm_val = norm.to<scalar_t>(); | ||
| scalar_t beta_val(beta); | ||
| auto norm_val_vec = Vec256<scalar_t>(norm_val); | ||
| auto beta_val_vec = Vec256<scalar_t>(beta_val); | ||
| const auto neg_1_vec = Vec256<scalar_t>(-1); | ||
| const auto zero_vec = Vec256<scalar_t>(0); | ||
| const auto pos_1_vec = Vec256<scalar_t>(1); | ||
| cpu_kernel_vec(iter, | ||
| [=](scalar_t input, scalar_t target, scalar_t grad_output) -> scalar_t { | ||
| const auto x = input - target; | ||
| if (x < -1.) | ||
| if (x <= -beta) | ||
| return -norm_val * grad_output; | ||
| else if (x > 1.) | ||
| else if (x >= beta) | ||
| return norm_val * grad_output; | ||
| else | ||
| return norm_val * x * grad_output; | ||
| return norm_val * x * grad_output / beta; | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the smooth l1 loss when |X| < 1 now divides by the beta parameter, so the derivative does as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be wrong. If it divides by beta in the forward pass, should we multiply the grad by beta in the backward pass?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Chatted with @bdhirsh offline. I was wrong. Brian's current implementation is correct.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like if
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might miss sth, but looks like this is different from the match given in the doc. Should the three condition be
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep you're right, thanks. And using |
||
| }, | ||
| [norm_val_vec, neg_1_vec, pos_1_vec]( | ||
| [norm_val_vec, beta_val_vec, neg_1_vec, zero_vec, pos_1_vec]( | ||
| Vec256<scalar_t> input, Vec256<scalar_t> target, Vec256<scalar_t> grad_output) -> Vec256<scalar_t> { | ||
| auto x = input - target; | ||
| x = clamp(x, neg_1_vec, pos_1_vec); | ||
| return norm_val_vec * x * grad_output; | ||
| // using two blendv calls to simulate the 3 cases | ||
| // 1 if x >= beta | ||
| // -1 if x <= -beta | ||
| // x / beta if |x| < beta | ||
| const auto x = input - target; | ||
| const auto pos_or_neg_1_vec = Vec256<scalar_t>::blendv( | ||
| neg_1_vec, pos_1_vec, x > zero_vec); | ||
| const auto x_abs = x.abs(); | ||
| const auto output = Vec256<scalar_t>::blendv( | ||
| x / beta_val_vec, pos_or_neg_1_vec, x_abs >= beta_val_vec); | ||
| return norm_val_vec * output * grad_output; | ||
| } | ||
| ); | ||
| }); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6545,23 +6545,23 @@ | |
| CPU: nll_loss2d_backward_cpu | ||
| CUDA: legacy::cuda::_thnn_nll_loss2d_backward | ||
|
|
||
| - func: smooth_l1_loss.out(Tensor self, Tensor target, int reduction=Mean, *, Tensor(a!) out) -> Tensor(a!) | ||
| - func: smooth_l1_loss.out(Tensor self, Tensor target, int reduction=Mean, float beta=1.0, *, Tensor(a!) out) -> Tensor(a!) | ||
| python_module: nn | ||
| dispatch: | ||
| CPU: smooth_l1_loss_out | ||
| CUDA: smooth_l1_loss_out | ||
|
|
||
| - func: smooth_l1_loss(Tensor self, Tensor target, int reduction=Mean) -> Tensor | ||
| - func: smooth_l1_loss(Tensor self, Tensor target, int reduction=Mean, float beta=1.0) -> Tensor | ||
| use_c10_dispatcher: full | ||
| python_module: nn | ||
|
|
||
| - func: smooth_l1_loss_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, int reduction, *, Tensor(a!) grad_input) -> Tensor(a!) | ||
| - func: smooth_l1_loss_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, int reduction, float beta=1.0, *, Tensor(a!) grad_input) -> Tensor(a!) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: you shouldn't need to provide a default parameter -- we don't expose these backwards functions to python, they should only be called via autograd. And you'll notice that reduction doesn't have a default parameter. The parts I'm not sure about here are if TorchScript would let you serialize this function (it shouldn't) and/or if the backwards compatibility test would complain. |
||
| python_module: nn | ||
| dispatch: | ||
| CPU: smooth_l1_loss_backward_out | ||
| CUDA: smooth_l1_loss_backward_out | ||
|
|
||
| - func: smooth_l1_loss_backward(Tensor grad_output, Tensor self, Tensor target, int reduction) -> Tensor | ||
| - func: smooth_l1_loss_backward(Tensor grad_output, Tensor self, Tensor target, int reduction, float beta=1.0) -> Tensor | ||
| use_c10_dispatcher: full | ||
| python_module: nn | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -307,25 +307,26 @@ inline Tensor cosine_embedding_loss( | |
|
|
||
| // ============================================================================ | ||
|
|
||
| inline Tensor _smooth_l1_loss(const Tensor& input, const Tensor& target) { | ||
| inline Tensor _smooth_l1_loss(const Tensor& input, const Tensor& target, double beta = 1.) { | ||
| auto t = torch::abs(input - target); | ||
| return torch::where(t < 1, 0.5 * torch::pow(t, 2), t - 0.5); | ||
| return torch::where(t < beta, 0.5 * torch::pow(t, 2) / beta, t - 0.5 * beta); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's not 100% clear to me if I need to change this, since this seems independent of the kernel code |
||
| } | ||
|
|
||
| #ifndef DOXYGEN_SHOULD_SKIP_THIS | ||
| namespace detail { | ||
| inline Tensor smooth_l1_loss( | ||
| const Tensor& input, | ||
| const Tensor& target, | ||
| SmoothL1LossFuncOptions::reduction_t reduction) { | ||
| SmoothL1LossFuncOptions::reduction_t reduction, | ||
| double beta = 1.) { | ||
| if (target.sizes() != input.sizes()) { | ||
| TORCH_WARN("Using a target size (", target.sizes(), ") that is different to the input size (", input.sizes(), "). ", | ||
| "This will likely lead to incorrect results due to broadcasting. ", | ||
| "Please ensure they have the same size."); | ||
| } | ||
|
|
||
| std::vector<Tensor> expanded_tensors = torch::broadcast_tensors({input, target}); | ||
| return torch::smooth_l1_loss(expanded_tensors[0], expanded_tensors[1], enumtype::reduction_get_enum(reduction)); | ||
| return torch::smooth_l1_loss(expanded_tensors[0], expanded_tensors[1], enumtype::reduction_get_enum(reduction), beta); | ||
| } | ||
| } // namespace detail | ||
| #endif /* DOXYGEN_SHOULD_SKIP_THIS */ | ||
|
|
@@ -344,8 +345,9 @@ inline Tensor smooth_l1_loss( | |
| inline Tensor smooth_l1_loss( | ||
| const Tensor& input, | ||
| const Tensor& target, | ||
| const SmoothL1LossFuncOptions& options = {}) { | ||
| return detail::smooth_l1_loss(input, target, options.reduction()); | ||
| const SmoothL1LossFuncOptions& options = {}, | ||
| double beta = 1.) { | ||
| return detail::smooth_l1_loss(input, target, options.reduction(), beta); | ||
| } | ||
|
|
||
| // ============================================================================ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -952,20 +952,24 @@ Tensor l1_loss_double_backward_grad_output(const Tensor & grad, const Tensor & i | |
| return output; | ||
| } | ||
|
|
||
| Tensor smooth_l1_loss_double_backward(const Tensor & grad, const Tensor & input, const Tensor & target, int64_t reduction) { | ||
| Tensor smooth_l1_loss_double_backward(const Tensor & grad, const Tensor & input, const Tensor & target, int64_t reduction, double beta) { | ||
| // special case to protect against a divide-by-zero. | ||
| if (beta == 0) { | ||
| return at::zeros(grad.sizes(), grad.options()); | ||
| } | ||
| auto d = (input - target).abs(); | ||
| auto grad_input = grad * (d < 1).type_as(grad); | ||
| auto grad_input = grad * (d < beta).type_as(grad) / beta; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can beta be 0 here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, good catch. I'll try rewriting it in a way to prevent divide by zeros. |
||
| if (reduction == at::Reduction::Mean) { | ||
| grad_input /= input.numel(); | ||
| } | ||
| return grad_input; | ||
| } | ||
|
|
||
| Tensor smooth_l1_loss_double_backward_grad_output(const Tensor & grad, const Tensor & grad_output, const Tensor & input, const Tensor & target, int64_t reduction) { | ||
| Tensor smooth_l1_loss_double_backward_grad_output(const Tensor & grad, const Tensor & grad_output, const Tensor & input, const Tensor & target, int64_t reduction, double beta) { | ||
| if (reduction == at::Reduction::None) { | ||
| return smooth_l1_loss_backward(grad, input, target, reduction); | ||
| return smooth_l1_loss_backward(grad, input, target, reduction, beta); | ||
| } | ||
| auto r = smooth_l1_loss_backward(ones_like(grad_output), input, target, reduction); | ||
| auto r = smooth_l1_loss_backward(ones_like(grad_output), input, target, reduction, beta); | ||
| return (r * grad).sum(); | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gchanan I updated smooth_l1_loss_out to something similar to what MSE does:
<reduction>_outfunctions to put the output of the reduction in the out var.My educated guess as to why we need the temporary variable is because
resulthas a different tensor shape than the tensor output of the actually kernel function when the caller is using reductions.