-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Hi
I’m trying to finetune Qwen3-30B-A3B-bnb-4bit using GRPO on an NVIDIA A100 (80GB), but I’m running into problems (high VRAM usage / training errors / instability).
I’m unsure whether this setup is supported or recommended, especially with:
MoE architecture (Qwen3)
BitsAndBytes 4-bit quantization
GRPOTrainer
FlashAttention / SDPA
Environment
GPU: A100 80GB
CUDA: 12.x
PyTorch / Transformers / TRL / PEFT / bitsandbytes: latest
Model loading:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer RL output
lora_rank = 32 # Larger rank = smarter, but slower
gpu_memory_utilization = 0.9
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "/workspace/Qwen3-30B-A3B-bnb-4bit",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True,
gpu_memory_utilization = gpu_memory_utilization,
)
Questions
Is 4-bit GRPO finetuning of Qwen3-30B-A3B supported?
Any recommended configs for A100 (precision, attention backend, device_map)?
Should FP16/BF16 be used instead of 4-bit for GRPO?
Any guidance would be appreciated. Thanks 🙏
i got error:
NotImplementedError Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/unsloth_zoo/vllm_utils.py:2103, in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, unsloth_vllm_standby, is_vision_model, return_args, max_num_seqs)
2102 else:
-> 2103 llm = LLM(**engine_args)
2104 pass
...
File /usr/local/lib/python3.10/dist-packages/unsloth/models/vision.py:754, in FastBaseModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, trust_remote_code, model_types, tokenizer_name, auto_model, use_gradient_checkpointing, supports_sdpa, whisper_language, whisper_task, auto_config, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, unsloth_vllm_standby, **kwargs)
751 load_vllm_kwargs[allowed_arg] = kwargs[allowed_arg]
753 # Load vLLM first
--> 754 llm = load_vllm(**load_vllm_kwargs)
756 # Convert to HF format
757 _, quant_state_dict = get_vllm_state_dict(
758 llm,
759 config = model_config,
760 is_vision_model = is_vlm,
761 )
File /usr/local/lib/python3.10/dist-packages/unsloth_zoo/vllm_utils.py:2128, in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, unsloth_vllm_standby, is_vision_model, return_args, max_num_seqs)
2123 print(
2124 f"Unsloth: Retrying vLLM to process {approx_max_num_seqs} sequences and {max_num_batched_tokens} tokens in tandem.\n"
2125 f"Error:\n{error}"
2126 )
2127 else:
-> 2128 raise RuntimeError(error)
2129 pass
2130 pass
RuntimeError: BitsAndBytesMoEMethod must select appropriate gemm implementation based on the prepare_finalize