Add CTC loss #9628

t-vi · 2018-07-20T14:47:38Z

The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.

This could eventually fix #3420

I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.

I could use feedback on all sorts of things, including:

Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
Input convention. I use log probs because that is what the gradients are for.
Launch parameters for the kernels
Errors and obmissions and anything else I'm not even aware of.

Thank you for looking!

In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does not use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:

CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel

Of course, I still have the (silly) outer blocks loop rather than computing consecutive s in each thread which I might change, and there are a few other things where one could look for better implementations.

Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.

My performance measuring testing script:

import timeit
import sys
import torch
num_labels = 10
target_length  = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16


torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()


def time_cuda_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo, culog_alpha = torch._ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_cudnn_ctc_loss(groupt, *args):
    torch.cuda.synchronize()
    culo, cugra= torch._cudnn_ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_warp_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

if sys.argv[1] == 'cuda':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
    import warpctc
    activations = activations.cuda().detach().requires_grad_()
    args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
    grout = activations.new_ones((batch_size,), device='cpu')
    torch.cuda.synchronize()

    print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))

I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.

The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the modification that is is in log space. The there also is a binding for the (much faster) CuDNN implementation.

t-vi · 2018-07-23T08:22:03Z

So I experimented a bit.

It seems beneficial to switch the t and s loop and to be able to cache the s target.
For small alphabet, the performance is still noticeably worse than cudnn, but I might just be OK with that if I can avoid too many explicit optimizations.
The naive "index_log_add_"-step is really, really terrible with large alphabets. Happily, there are two obvious ways to fix this.
- One could move to non-log-space, but the numerical stability is problematic. the log_likelihood term is easily adjusted for before the exponentiation, but it is not as clear what to do with the prob term (take the min log_prob over the alphabet and adjust with that?). I would expect real activations to be more extremely spread out than unit random normal, so this is something to be careful about.
- Or one could do index_log_add_ properly. Yay! (@soumith Would you want index_* ported to native in the process?)
At any rate, separating the blank target and summing over that seems a good idea to no spin in atomic (log-) add.

(Benchmark idea from @galv , http://danielgalvez.me/jekyll/update/2017/12/29/benchmarking-ctc-implementations.html, but with different random target lengths and my own errors, warpctc is unfair because it involves moving the result to cpu and gradientsback)

input_length	batch_size	num_labels	cudnn	cuda original PR	warpctc (unfair)	cuda with loop switched
150	1	28	264	633	3597	484
150	2	28	263	772	3591	554
150	4	28	284	859	3741	640
150	8	28	350	856	3758	679
150	16	28	482	901	3784	721
150	32	28	781	1059	3887	853
150	64	28	1360	2163	3934	1802
150	128	28	2646	5236	4139	4712
150	256	28	5149	10854	4385	9976
150	1	5000	340	2737	9460	2627
150	2	5000	450	4224	10440	4003
150	4	5000	589	7589	10659	7394
150	8	5000	898	7598	11068	7393
150	16	5000	1520	8189	11908	7944
150	32	5000	2736	9769	13665	9572
150	64	5000	5284	23180	17500	22644
150	128	5000	10483	120424	24976	117178

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2018-07-23T18:31:34Z

Cudnn implementation is likely following this GTC talk http://on-demand.gputechconf.com/gtc/2016/presentation/s6383-minmin-sun-speech-recognition.pdf, perhaps you can use it to speed up your cuda implementation.

t-vi · 2018-07-23T18:38:45Z

Awesome, thank you very much! I'll check that out.

ssnl

I did a brief pass on the CPU part because it seems further optimization on CUDA side will be made. The implementation is straightforward and seems correct (didn't look into the maths details), and indeed has room for optimization (as mentioned in the comments). I left some comments.

aten/src/ATen/native/LossCTC.cpp

+
+namespace {
+
+inline int64_t get_target_prime(int64_t* target, int64_t offset, int64_t stride, int64_t idx, int64_t BLANK) {


aten/src/ATen/native/LossCTC.cpp

+}
+
+template<typename scalar_t>
+std::tuple<Tensor, Tensor> ctc_loss_cpu_template(const Tensor& log_probs, const Tensor& targets, const Tensor& input_lengths, const Tensor& target_lengths, int64_t BLANK) {


aten/src/ATen/native/LossCTC.cpp

+  auto log_alpha_data = log_alpha.data<scalar_t>();
+  auto targets_data = targets.data<int64_t>();
+  auto neg_log_likelihood_data = neg_log_likelihood.data<scalar_t>(); // we assume stride one for the only dimension for freshly allocated tensor
+  size_t lp_input_stride = log_probs.stride(0);


aten/src/ATen/native/LossCTC.cpp

+  log_alpha.narrow(1, 0, 1).fill_(neginf); // or do this inside the batch loop?
+  #pragma omp parallel for
+  for (int64_t b = 0; b < batch_size; b++) {
+    int64_t input_length = input_lengths[b].toCLong();


aten/src/ATen/native/LossCTC.cpp

+  for (int64_t b = 0; b < batch_size; b++) {
+    int64_t input_length = input_lengths[b].toCLong();
+    int64_t target_length = target_lengths[b].toCLong();
+    int64_t lp_batch_offset = b*log_probs.stride(1);


t-vi · 2018-07-25T14:45:44Z

I'm still working on incorporating Simon's feedback and the tests, but I wanted to push the cuda updates.

So for small to medium batch sizes we still are slower than CuDNN, but I don't think it's unreasonably slow. For large batch sizes, we are actually pretty fast. The deviation between cuda and cudnn is of order 1e-4 in the gradients. I haven't checked who is closer to the true value (by comparing to the full-log-space-results with double).

I still need to do the tests. I might yet merge the beta calculation into that of alpha for cuda, as I think that would give some speedup. Also writing a kernel for the larger bits of the gradient calculations currently done in ATen is on my list of potential speedups.

ngimel · 2018-07-25T16:42:29Z

Nice cuda improvements! Many people are averse to non-deterministic results that would result from atomicAdd, that I think was the reason why cudnn added deterministic grad calculation that does not follow GTC paper. So it's more of a question for core developers - are we ok with nondeterminism?

t-vi · 2018-07-25T18:28:18Z

I think we already have nondeterminism e.g. in index_add_. I see this as index_add_ fused with exp(x+nll-log_probs), but it is for someone else to decide.

Based on SsnL's review comments. Thank you!

To get new location of getCurrentCUDAStream

ezyang · 2018-07-25T21:29:01Z

@pytorchbot retest this please

t-vi · 2018-07-26T13:14:25Z

So, it seems that things are converging: Some questions

Is the use of atomicAdd OK (regarding nondeterminism)?
Is it true that the targets for cudnn are on the CPU? The documentation doesn't explicitly say, but I took that I need to pass a typed pointer and no descriptor as a sign.
Currently we silently move to cuda whenever we detect something unfit for cudnn. Is that OK or will it confuse users?
For the test reference implementation: I skipped the "move to logspace". Is that OK? So far the tests seem to pass.

While one can always add more optimizations (and it's kind of fun, too), I think the above are the top questions I could use your input on.

t-vi · 2018-07-26T15:20:53Z

So the situation with the cudnn ctc requiring int targets seems bad enough that I'm inclined to make the target type a template parameter and move them to cuda if they are on CPU for the cuda variant.
After all, people would expect something working with the cudnn variant to also work without cudnn. And that the criterion actually is cudnn >= 7 doesn't help (I missed that for the tests...)

t-vi · 2018-07-26T20:29:12Z

So the Windows build failure is about at ATen function I use. I could use a hint as to what goes wrong.
Edit: I think it's a missing AT_API.

Summary: I hope this helps me for the windows build failure in #9628 . Pull Request resolved: #9904 Differential Revision: D9026715 Pulled By: soumith fbshipit-source-id: bb97d41d060823f5a37bfc9a1659815b8b9f4eab

To get the missing symbol on Windows

t-vi · 2018-07-28T09:27:45Z

So I found TensorAccessors and they I couldn't resist switching to them (30 lines of code less and much clearer indexing) for CPU. I also changed the CPU and CUDA versions to all accept int labels (and move them to GPU as needed). The idea is that if you program against CuDNN (where you have to have CPU int labels) you can switch to CUDA/CPU as needed.

The remaining two CI failures (running out of heap on Windows somewhere else, some timeout for the ROCm one) look like I cannot do much about.

I had to adapt the test infrastructure a bit to cover all cases (so additional arguments and not converting targets).

facebook-github-bot

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: I hope this helps me for the windows build failure in pytorch#9628 . Pull Request resolved: pytorch#9904 Differential Revision: D9026715 Pulled By: soumith fbshipit-source-id: bb97d41d060823f5a37bfc9a1659815b8b9f4eab

ssnl · 2018-07-31T17:48:47Z

@pytorchbot retest this please

torch/nn/modules/loss.py

+        In order to use CuDNN, the following must be satisfied: :attr:`targets` must be
+        in concatenated format, all :attr:`input_lengths` must be `T`.  :math:`blank=0`,
+        :attr:`target_lengths` :math:`\leq 256`, the integer arguments must be of
+        :class:`torch.IntTensor`.


aten/src/ATen/native/LossCTC.cpp

+
+  int64_t batch_size = log_probs.size(1);
+  int64_t num_labels = log_probs.size(2);
+  AT_CHECK(BLANK < num_labels, "blank must be in label range");


Summary: The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the modification that is is in log space. The there also is a binding for the (much faster) CuDNN implementation. This could eventually fix #3420 I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments. I could use feedback on all sorts of things, including: - Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors) - Input convention. I use log probs because that is what the gradients are for. - Launch parameters for the kernels - Errors and obmissions and anything else I'm not even aware of. Thank you for looking! In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this). I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step. Average timings for the kernels from nvprof for some size: ``` CuDNN: 60.464us compute_alphas_and_betas 16.755us compute_grads_deterministic Cuda: 121.06us ctc_loss_backward_collect_gpu_kernel (= grads) 109.88us ctc_loss_gpu_kernel (= alphas) 98.517us ctc_loss_backward_betas_gpu_kernel (= betas) WarpCTC: 299.74us compute_betas_and_grad_kernel 66.977us compute_alpha_kernel ``` Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations. Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably. My performance measuring testing script: ``` import timeit import sys import torch num_labels = 10 target_length = 30 input_length = 50 eps = 1e-5 BLANK = 0#num_labels batch_size = 16 torch.manual_seed(5) activations = torch.randn(input_length, batch_size, num_labels + 1) log_probs = torch.log_softmax(activations, 2) probs = torch.exp(log_probs) targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long) targets_2d = targets.view(batch_size, target_length) target_lengths = torch.tensor(batch_size*[target_length]) input_lengths = torch.tensor(batch_size*[input_length]) activations = log_probs.detach() def time_cuda_ctc_loss(grout, *args): torch.cuda.synchronize() culo, culog_alpha = torch._ctc_loss(*args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_cudnn_ctc_loss(groupt, *args): torch.cuda.synchronize() culo, cugra= torch._cudnn_ctc_loss(*args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_warp_ctc_loss(grout, *args): torch.cuda.synchronize() culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() if sys.argv[1] == 'cuda': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals())) elif sys.argv[1] == 'cudnn': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals())) elif sys.argv[1] == 'warpctc': import warpctc activations = activations.cuda().detach().requires_grad_() args = [activations, input_lengths.int(), targets.int(), target_lengths.int()] grout = activations.new_ones((batch_size,), device='cpu') torch.cuda.synchronize() print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals())) ``` I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it. Pull Request resolved: pytorch/pytorch#9628 Differential Revision: D8952453 Pulled By: ezyang fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860

soumith · 2018-07-31T23:11:09Z

as reported by someone internally, the example is broken.

TypeError: ctc_loss(): argument 'input_lengths' (position 3) must be tuple of ints, not Tensor

Can we allow input_lengths and target_lengths to optionally also be a Tensor, not just tuple of ints.

t-vi · 2018-08-01T04:41:47Z

Oh. I'll fix that and look at allowing tensors.

Summary: I hope this helps me for the windows build failure in pytorch#9628 . Pull Request resolved: pytorch#9904 Differential Revision: D9026715 Pulled By: soumith fbshipit-source-id: bb97d41d060823f5a37bfc9a1659815b8b9f4eab

Summary: The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the modification that is is in log space. The there also is a binding for the (much faster) CuDNN implementation. This could eventually fix pytorch#3420 I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments. I could use feedback on all sorts of things, including: - Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors) - Input convention. I use log probs because that is what the gradients are for. - Launch parameters for the kernels - Errors and obmissions and anything else I'm not even aware of. Thank you for looking! In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this). I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step. Average timings for the kernels from nvprof for some size: ``` CuDNN: 60.464us compute_alphas_and_betas 16.755us compute_grads_deterministic Cuda: 121.06us ctc_loss_backward_collect_gpu_kernel (= grads) 109.88us ctc_loss_gpu_kernel (= alphas) 98.517us ctc_loss_backward_betas_gpu_kernel (= betas) WarpCTC: 299.74us compute_betas_and_grad_kernel 66.977us compute_alpha_kernel ``` Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations. Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably. My performance measuring testing script: ``` import timeit import sys import torch num_labels = 10 target_length = 30 input_length = 50 eps = 1e-5 BLANK = 0#num_labels batch_size = 16 torch.manual_seed(5) activations = torch.randn(input_length, batch_size, num_labels + 1) log_probs = torch.log_softmax(activations, 2) probs = torch.exp(log_probs) targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long) targets_2d = targets.view(batch_size, target_length) target_lengths = torch.tensor(batch_size*[target_length]) input_lengths = torch.tensor(batch_size*[input_length]) activations = log_probs.detach() def time_cuda_ctc_loss(grout, *args): torch.cuda.synchronize() culo, culog_alpha = torch._ctc_loss(*args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_cudnn_ctc_loss(groupt, *args): torch.cuda.synchronize() culo, cugra= torch._cudnn_ctc_loss(*args) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() def time_warp_ctc_loss(grout, *args): torch.cuda.synchronize() culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False) g, = torch.autograd.grad(culo, args[0], grout) torch.cuda.synchronize() if sys.argv[1] == 'cuda': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals())) elif sys.argv[1] == 'cudnn': lpcu = log_probs.float().cuda().detach().requires_grad_() args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True] grout = lpcu.new_ones((batch_size,)) torch.cuda.synchronize() print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals())) elif sys.argv[1] == 'warpctc': import warpctc activations = activations.cuda().detach().requires_grad_() args = [activations, input_lengths.int(), targets.int(), target_lengths.int()] grout = activations.new_ones((batch_size,), device='cpu') torch.cuda.synchronize() print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals())) ``` I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it. Pull Request resolved: pytorch#9628 Differential Revision: D8952453 Pulled By: ezyang fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860

Add CTC loss.

00a4cae

The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the modification that is is in log space. The there also is a binding for the (much faster) CuDNN implementation.

t-vi requested review from apaszke, colesbury, ezyang, gchanan, soumith, ssnl, zdevito and zou3519 as code owners July 20, 2018 14:47

relint after editing docs

77eb0e5

facebook-github-bot reviewed Jul 23, 2018

View reviewed changes

zou3519 assigned ssnl Jul 24, 2018

ssnl reviewed Jul 24, 2018

View reviewed changes

cuda updates

2e15493

t-vi added 3 commits July 25, 2018 22:04

Move to IntList, misc improvements

3b32cdd

Based on SsnL's review comments. Thank you!

Merge branch 'master' into ctc_loss

82c219d

To get new location of getCurrentCUDAStream

catch up with ATen api

0d82ad7

t-vi added 3 commits July 26, 2018 13:51

proper conversion of intlist to tensor

0986247

add test for CTC loss

417a6da

fix tests, test more

82f50e5

t-vi added 2 commits July 26, 2018 15:17

testing improvements mostly

4fda552

comment spacing

b25e5ce

extra args for legacy

812792c

carmocca mentioned this pull request Jul 26, 2018

PyTorch 0.4.1 update jpuigcerver/PyLaia#8

Merged

14 tasks

accept more target types for cuda, add cudnn version check to test

2ad53b7

sneaky whitespace

4a412d5

This was referenced Jul 26, 2018

export tensor(ArrayRef, options) on Windows #9904

Closed

[feature request] add deviceTensor interface to ATen #8366

Closed

everyone wants int targets, so everyone gets int targets

e37bfb6

t-vi added 3 commits July 27, 2018 22:45

use accessors for CPU

2ade8f8

Merge branch 'master' into ctc_loss

ece3c83

To get the missing symbol on Windows

opt out of dtype conversion for targets

11b97f7

facebook-github-bot reviewed Jul 29, 2018

View reviewed changes

ezyang approved these changes Jul 31, 2018

View reviewed changes

ssnl reviewed Jul 31, 2018

View reviewed changes

aten/src/ATen/native/LossCTC.cpp

int64_t batch_size = log_probs.size(1);

int64_t num_labels = log_probs.size(2);

AT_CHECK(BLANK < num_labels, "blank must be in label range");

This comment was marked as off-topic.

Sign in to view

facebook-github-bot closed this in 685224a Jul 31, 2018

ezyang added open source merged labels Jun 24, 2019


		namespace {

		inline int64_t get_target_prime(int64_t* target, int64_t offset, int64_t stride, int64_t idx, int64_t BLANK) {

Add CTC loss #9628

Add CTC loss #9628

Uh oh!

Conversation

t-vi commented Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jul 23, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 23, 2018

Uh oh!

t-vi commented Jul 23, 2018

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

t-vi commented Jul 25, 2018

Uh oh!

ngimel commented Jul 25, 2018

Uh oh!

t-vi commented Jul 25, 2018

Uh oh!

ezyang commented Jul 25, 2018

Uh oh!

t-vi commented Jul 26, 2018

Uh oh!

t-vi commented Jul 26, 2018

Uh oh!

t-vi commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jul 28, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Jul 31, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

soumith commented Jul 31, 2018

Uh oh!

t-vi commented Aug 1, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

t-vi commented Jul 20, 2018 •

edited

Loading

t-vi commented Jul 26, 2018 •

edited

Loading