Skip to content

GPT-oss-20b-BF16: torch.OutOfMemoryError: CUDA out of memory. #3966

@Decadz

Description

@Decadz

Hi,

Many thanks for your great library! I am trying to perform a distributed training run of unsloth/gpt-oss-20b-BF16 on x8 A100s (40gb); however, I am running into memory issues when trying to load the model into memory using the code below. I saw you mentioned that it was mentioned that you will need ~65gb of ram, I have 8 x 40gb = 320 GB, so I don't see why there is any issue.

Question: Do you have any suggestions as to why this might be the case? Am I performing the loading of the unsloth model correctly?

Debugging Steps

  • I have checked all devices are visible.
  • Tried manually setting the device_map={'':local_rank}

Dependencies

torch: 2.10.0
transformers: 4.57.3
trl: 0.24.0
unsloth: 2026.1.4

Accelerate Config

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
debug: false

Shell Script

module purge
module load GCCcore/9.3.0 CUDA/12.2.0 cuDNN/8.9.2.26-CUDA-12.2.0

# Ensuring all GPUs on the node are visible.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Path to accelerate config file.
CONFIG="accelerate_config.yaml"

# Launching the script with accelerate
accelerate launch --config_file ${CONFIG} script.py

Python Code

from unsloth import FastLanguageModel
import torch

max_seq_length = 768
lora_rank = 4

# Loading the model and its tokenizer.
model, tokenizer = FastLanguageModel.from_pretrained(
      model_name="unsloth/gpt-oss-20b-BF16",
      max_seq_length=1024,
      load_in_4bit=False,
      offload_embedding=True, 
      device_map = "balanced",
)

# Applying LoRA to the model.
model = FastLanguageModel.get_peft_model(
      model,
      r = lora_rank,
      target_modules = [
          "q_proj", "k_proj", "v_proj", "o_proj",
          "gate_proj", "up_proj", "down_proj",
      ],
      lora_alpha = lora_rank * 2,
      use_gradient_checkpointing = "unsloth",
)

Checklist

  1. Did you update? pip install --upgrade unsloth unsloth_zoo. Yes.
  2. Colab or Kaggle or local / cloud. Local.
  3. Number GPUs used, use nvidia-smi. x8 A100 40gb.
  4. Which notebook? Please link! Following this page.
  5. Which Unsloth version, TRL version, transformers version, PyTorch version? see above
  6. Which trainer? SFTTrainer, GRPOTrainer etc. Will be using GRPOTrainer

nvidia-smi

Image

Error Trace

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedHelp from the OSS community wanted!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions