Skip to content

[JIT] Memory Leak during tracing? #21454

@chughtapan

Description

@chughtapan

Hi,

I am trying to run pytorch-pretrained-BERT through the JIT using the tracing API. I ran the example run_squad.py without any changes with the following command and it worked without any issues.

CUDA_VISIBLE_DEVICES="0" python run_squad.py \
       --bert_model bert-large-uncased \
       --fp16 \
       --do_train \
       --do_lower_case \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size 6 \
       --learning_rate 3e-5 \
       --num_train_epochs 2.0 \
       --max_seq_length 512 \
       --doc_stride 128 \
       --output_dir /tmp/debug_squad/

To run the script with the JIT, I changed the following lines

        model.train()
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
                if n_gpu == 1:
                    batch = tuple(t.to(device) for t in batch) # multi-gpu does scattering it-self
                input_ids, input_mask, segment_ids, start_positions, end_positions = batch
                loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)

to be

        model.train()
        traced = False
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
                if n_gpu == 1:
                    batch = tuple(t.to(device) for t in batch) # multi-gpu does scattering it-self
                input_ids, input_mask, segment_ids, start_positions, end_positions = batch
                if not traced:
                    model = torch.jit.trace(model, (input_ids, segment_ids, input_mask, start_positions, end_positions), check_trace=False)
                    traced = True
                    logger.info("Tracing complete")
                loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)

I also disabled the FusedLayerNorm here to make it run with the tracing.

I ran the modified script with the same command, but I got a CUDA OOM Error.
Error Log: log

Since the unmodified code was running perfectly, the traced module should also run within the available GPU memory. Am I doing something wrong?

Environment

PyTorch version: 1.1.0
Is debug build: Yes
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 418.67
cuDNN version: /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip] numpy==1.16.3
[pip] pytorch-pretrained-bert==0.6.2
[pip] torch==1.1.0
[conda] blas 1.0 mkl
[conda] magma-cuda100 2.5.0 1 pytorch
[conda] mkl 2019.3 199
[conda] mkl-include 2019.3 199
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch-pretrained-bert 0.6.2
[conda] torch 1.1.0

Thanks,
Tapan

Metadata

Metadata

Assignees

Labels

oncall: jitAdd this issue/PR to JIT oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions