-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Open
Labels
help wantedHelp from the OSS community wanted!Help from the OSS community wanted!
Description
Hi,
Many thanks for your great library! I am trying to perform a distributed training run of unsloth/gpt-oss-20b-BF16 on x8 A100s (40gb); however, I am running into memory issues when trying to load the model into memory using the code below. I saw you mentioned that it was mentioned that you will need ~65gb of ram, I have 8 x 40gb = 320 GB, so I don't see why there is any issue.
Question: Do you have any suggestions as to why this might be the case? Am I performing the loading of the unsloth model correctly?
Debugging Steps
- I have checked all devices are visible.
- Tried manually setting the device_map={'':local_rank}
Dependencies
torch: 2.10.0
transformers: 4.57.3
trl: 0.24.0
unsloth: 2026.1.4
Accelerate Config
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
debug: falseShell Script
module purge
module load GCCcore/9.3.0 CUDA/12.2.0 cuDNN/8.9.2.26-CUDA-12.2.0
# Ensuring all GPUs on the node are visible.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Path to accelerate config file.
CONFIG="accelerate_config.yaml"
# Launching the script with accelerate
accelerate launch --config_file ${CONFIG} script.pyPython Code
from unsloth import FastLanguageModel
import torch
max_seq_length = 768
lora_rank = 4
# Loading the model and its tokenizer.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gpt-oss-20b-BF16",
max_seq_length=1024,
load_in_4bit=False,
offload_embedding=True,
device_map = "balanced",
)
# Applying LoRA to the model.
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank,
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = lora_rank * 2,
use_gradient_checkpointing = "unsloth",
)Checklist
- Did you update?
pip install --upgrade unsloth unsloth_zoo.Yes. ColaborKaggleor local / cloud. Local.- Number GPUs used, use
nvidia-smi. x8 A100 40gb. - Which notebook? Please link! Following this page.
- Which Unsloth version, TRL version, transformers version, PyTorch version? see above
- Which trainer?
SFTTrainer,GRPOTraineretc. Will be using GRPOTrainer
nvidia-smi
Error Trace

Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
help wantedHelp from the OSS community wanted!Help from the OSS community wanted!