CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED

**System information**
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
- TensorFlow installed from (source or binary): source and binary
- TensorFlow version (use command below): 1.15
- Python version: 3.6, 3.7.8
- Bazel version (if compiling from source): 0.26.1
- GCC/Compiler version (if compiling from source): 9.0
- CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
- GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ; 

You can collect some of this information using our environment capture
[script](https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh)
You can also obtain the TensorFlow version with:
1. TF 1.0: `python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"`: v1.15.3-0-g4386a6640c


**Describe the current behavior**
Training with some dataset triggers:
```
2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 
```

**Describe the expected behavior**
Training should succeed, or TensorFlow or CUDNN should expose a more actionable error

**Standalone code to reproduce the issue**
Will be provided after.

**Other info / logs**
Will be provided after. Some noisy debugging session can be seen at https://github.com/mozilla/DeepSpeech/issues/3088


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED #41630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED #41630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions