GPT-oss-20b-BF16: torch.OutOfMemoryError: CUDA out of memory.

Hi, 

Many thanks for your great library! I am trying to perform a distributed training run of [unsloth/gpt-oss-20b-BF16](https://huggingface.co/unsloth/gpt-oss-20b-BF16) on x8 A100s (40gb); however, I am running into memory issues when trying to load the model into memory using the code below. I saw you mentioned that it was [mentioned](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune#making-efficient-gpt-oss-fine-tuning-work) that you will need ~65gb of ram, I have 8 x 40gb = 320 GB, so I don't see why there is any issue. 

**Question:** Do you have any suggestions as to why this might be the case? Am I performing the loading of the unsloth model correctly?

### Debugging Steps
- I have checked all devices are visible.
- Tried manually setting the device_map={'':local_rank}

### Dependencies
```
torch: 2.10.0
transformers: 4.57.3
trl: 0.24.0
unsloth: 2026.1.4
```

### Accelerate Config
```python
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
debug: false
```

### Shell Script
```shell
module purge
module load GCCcore/9.3.0 CUDA/12.2.0 cuDNN/8.9.2.26-CUDA-12.2.0

# Ensuring all GPUs on the node are visible.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Path to accelerate config file.
CONFIG="accelerate_config.yaml"

# Launching the script with accelerate
accelerate launch --config_file ${CONFIG} script.py
```

### Python Code
```python
from unsloth import FastLanguageModel
import torch

max_seq_length = 768
lora_rank = 4

# Loading the model and its tokenizer.
model, tokenizer = FastLanguageModel.from_pretrained(
      model_name="unsloth/gpt-oss-20b-BF16",
      max_seq_length=1024,
      load_in_4bit=False,
      offload_embedding=True, 
      device_map = "balanced",
)

# Applying LoRA to the model.
model = FastLanguageModel.get_peft_model(
      model,
      r = lora_rank,
      target_modules = [
          "q_proj", "k_proj", "v_proj", "o_proj",
          "gate_proj", "up_proj", "down_proj",
      ],
      lora_alpha = lora_rank * 2,
      use_gradient_checkpointing = "unsloth",
)
```

### Checklist
1. Did you update? `pip install --upgrade unsloth unsloth_zoo.` **_Yes_**.
2. `Colab` or `Kaggle` or local / cloud. **_Local_**.
3. Number GPUs used, use `nvidia-smi`. **_x8 A100 40gb._**
4. Which notebook? Please link! [Following this page.](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning/tutorial-how-to-train-gpt-oss-with-rl)
5. Which Unsloth version, TRL version, transformers version, PyTorch version? see above
6. Which trainer? `SFTTrainer`, `GRPOTrainer` etc. Will be using **_GRPOTrainer_**

### nvidia-smi
<img width="648" height="548" alt="Image" src="https://github.com/user-attachments/assets/c80a8a4d-eeda-4948-a818-fcb4976e49b6" />

### Error Trace
<img width="1419" height="550" alt="Image" src="https://github.com/user-attachments/assets/c77ee045-3d88-4d8f-b286-4e56f3d7627b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT-oss-20b-BF16: torch.OutOfMemoryError: CUDA out of memory. #3966

Debugging Steps

Dependencies

Accelerate Config

Shell Script

Python Code

Checklist

nvidia-smi

Error Trace

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GPT-oss-20b-BF16: torch.OutOfMemoryError: CUDA out of memory. #3966

Description

Debugging Steps

Dependencies

Accelerate Config

Shell Script

Python Code

Checklist

nvidia-smi

Error Trace

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions