Skip to content

CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED #41630

@lissyx

Description

@lissyx

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): source and binary
  • TensorFlow version (use command below): 1.15
  • Python version: 3.6, 3.7.8
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): 9.0
  • CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
  • GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c

Describe the current behavior
Training with some dataset triggers:

2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

Describe the expected behavior
Training should succeed, or TensorFlow or CUDNN should expose a more actionable error

Standalone code to reproduce the issue
Will be provided after.

Other info / logs
Will be provided after. Some noisy debugging session can be seen at mozilla/DeepSpeech#3088

Metadata

Metadata

Labels

TF 1.15for issues seen on TF 1.15comp:gpuGPU related issuestype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions