-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Closed
Labels
TF 1.15for issues seen on TF 1.15for issues seen on TF 1.15comp:gpuGPU related issuesGPU related issuestype:bugBugBug
Description
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
- TensorFlow installed from (source or binary): source and binary
- TensorFlow version (use command below): 1.15
- Python version: 3.6, 3.7.8
- Bazel version (if compiling from source): 0.26.1
- GCC/Compiler version (if compiling from source): 9.0
- CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
- GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c
Describe the current behavior
Training with some dataset triggers:
2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
Describe the expected behavior
Training should succeed, or TensorFlow or CUDNN should expose a more actionable error
Standalone code to reproduce the issue
Will be provided after.
Other info / logs
Will be provided after. Some noisy debugging session can be seen at mozilla/DeepSpeech#3088
Metadata
Metadata
Labels
TF 1.15for issues seen on TF 1.15for issues seen on TF 1.15comp:gpuGPU related issuesGPU related issuestype:bugBugBug