[JIT] Memory Leak during tracing?

Hi,

I am trying to run [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) through the JIT using the tracing API. I ran the example [run_squad.py](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_squad.py) without any changes with the following command and it worked without any issues.
```
CUDA_VISIBLE_DEVICES="0" python run_squad.py \
       --bert_model bert-large-uncased \
       --fp16 \
       --do_train \
       --do_lower_case \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size 6 \
       --learning_rate 3e-5 \
       --num_train_epochs 2.0 \
       --max_seq_length 512 \
       --doc_stride 128 \
       --output_dir /tmp/debug_squad/
```

To run the script with the JIT, I changed the following lines 
```
        model.train()
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
                if n_gpu == 1:
                    batch = tuple(t.to(device) for t in batch) # multi-gpu does scattering it-self
                input_ids, input_mask, segment_ids, start_positions, end_positions = batch
                loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
```
to be 
```
        model.train()
        traced = False
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
                if n_gpu == 1:
                    batch = tuple(t.to(device) for t in batch) # multi-gpu does scattering it-self
                input_ids, input_mask, segment_ids, start_positions, end_positions = batch
                if not traced:
                    model = torch.jit.trace(model, (input_ids, segment_ids, input_mask, start_positions, end_positions), check_trace=False)
                    traced = True
                    logger.info("Tracing complete")
                loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
```
I also disabled the FusedLayerNorm [here](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L231) to make it run with the tracing.

I ran the modified script with the same command, but I got a CUDA OOM Error. 
Error Log: [log](https://gist.github.com/chughtapan/590af6f078143436c10817b690b8b028)

Since the unmodified code was running perfectly, the traced module should also run within the available GPU memory. Am I doing something wrong? 

## Environment

PyTorch version: 1.1.0
Is debug build: Yes
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 418.67
cuDNN version: /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip] numpy==1.16.3
[pip] pytorch-pretrained-bert==0.6.2
[pip] torch==1.1.0
[conda] blas                      1.0                         mkl
[conda] magma-cuda100             2.5.0                         1    pytorch
[conda] mkl                       2019.3                      199
[conda] mkl-include               2019.3                      199
[conda] mkl_fft                   1.0.12           py36ha843d7b_0
[conda] mkl_random                1.0.2            py36hd81dba3_0
[conda] pytorch-pretrained-bert   0.6.2                     <pip>
[conda] torch                     1.1.0                     <pip>

Thanks,
Tapan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JIT] Memory Leak during tracing? #21454

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[JIT] Memory Leak during tracing? #21454

Description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions