The Ultra-Fast LLM Quantization & Export Library
Load โ Quantize โ Fine-tune โ Export โ All in One Line
Quick Start โข Features โข Export Formats โข Examples โข Documentation
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Then llama.cpp compilation for GGUF...
# Then manual tensor conversion...from quantllm import turbo
model = turbo("meta-llama/Llama-3-8B") # Auto-quantizes
model.generate("Hello!") # Generate text
model.export("gguf", quantization="Q4_K_M") # Export to GGUF# Recommended
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With all export formats
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"from quantllm import turbo
# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text
response = model.generate("Explain quantum computing simply")
print(response)
# Export to GGUF
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")QuantLLM automatically:
- โ Detects your GPU and available memory
- โ Applies optimal 4-bit quantization
- โ Enables Flash Attention 2 when available
- โ Configures memory management
One unified interface for everything:
model = turbo("mistralai/Mistral-7B")
model.generate("Hello!")
model.finetune(data, epochs=3)
model.export("gguf", quantization="Q4_K_M")
model.push("user/repo", format="gguf")- Flash Attention 2 โ Auto-enabled for speed
- torch.compile โ 2x faster training
- Dynamic Padding โ 50% less VRAM
- Triton Kernels โ Fused operations
Llama 2/3, Mistral, Mixtral, Qwen 1/2, Phi 1/2/3, Gemma, Falcon, DeepSeek, Yi, StarCoder, ChatGLM, InternLM, Baichuan, StableLM, BLOOM, OPT, MPT, GPT-NeoX...
| Format | Use Case | Command |
|---|---|---|
| GGUF | llama.cpp, Ollama, LM Studio | model.export("gguf") |
| ONNX | ONNX Runtime, TensorRT | model.export("onnx") |
| MLX | Apple Silicon (M1/M2/M3/M4) | model.export("mlx") |
| SafeTensors | HuggingFace | model.export("safetensors") |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ QuantLLM v2.0.0 โ
โ Ultra-fast LLM Quantization & Export โ
โ โ GGUF โ ONNX โ MLX โ SafeTensors โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Model: meta-llama/Llama-3.2-3B
Parameters: 3.21B
Memory: 6.4 GB โ 1.9 GB (70% saved)
Auto-generates model cards with YAML frontmatter, usage examples, and "Use this model" button:
model.push("user/my-model", format="gguf", quantization="Q4_K_M")Export to any deployment target with a single line:
from quantllm import turbo
model = turbo("microsoft/phi-3-mini")
# GGUF โ For llama.cpp, Ollama, LM Studio
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# ONNX โ For ONNX Runtime, TensorRT
model.export("onnx", "./model-onnx/")
# MLX โ For Apple Silicon Macs
model.export("mlx", "./model-mlx/", quantization="4bit")
# SafeTensors โ For HuggingFace
model.export("safetensors", "./model-hf/")| Type | Bits | Quality | Use Case |
|---|---|---|---|
Q2_K |
2-bit | ๐ด Low | Minimum size |
Q3_K_M |
3-bit | ๐ Fair | Very constrained |
Q4_K_M |
4-bit | ๐ข Good | Recommended โญ |
Q5_K_M |
5-bit | ๐ข High | Quality-focused |
Q6_K |
6-bit | ๐ต Very High | Near-original |
Q8_0 |
8-bit | ๐ต Excellent | Best quality |
from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Simple generation
response = model.generate(
"Write a Python function for fibonacci",
max_new_tokens=200,
temperature=0.7,
)
print(response)
# Chat format
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)from quantllm import TurboModel
model = TurboModel.from_gguf(
"TheBloke/Llama-2-7B-Chat-GGUF",
filename="llama-2-7b-chat.Q4_K_M.gguf"
)
print(model.generate("Hello!"))from quantllm import turbo
model = turbo("mistralai/Mistral-7B")
# Simple training
model.finetune("training_data.json", epochs=3)
# Advanced configuration
model.finetune(
"training_data.json",
epochs=5,
learning_rate=2e-4,
lora_r=32,
lora_alpha=64,
batch_size=4,
)Supported data formats:
[
{"instruction": "What is Python?", "output": "Python is..."},
{"text": "Full text for language modeling"},
{"prompt": "Question", "completion": "Answer"}
]from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Push with auto-generated model card
model.push(
"your-username/my-model",
format="gguf",
quantization="Q4_K_M",
license="apache-2.0"
)| Configuration | GPU VRAM | Recommended Models |
|---|---|---|
| ๐ข Entry | 6-8 GB | 1-7B (4-bit) |
| ๐ก Mid-Range | 12-24 GB | 7-30B (4-bit) |
| ๐ด High-End | 24-80 GB | 70B+ |
Tested GPUs: RTX 3060/3070/3080/3090/4070/4080/4090, A100, H100, Apple M1/M2/M3/M4
# Basic
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With specific features
pip install "quantllm[gguf]" # GGUF export
pip install "quantllm[onnx]" # ONNX export
pip install "quantllm[mlx]" # MLX export (Apple Silicon)
pip install "quantllm[triton]" # Triton kernels
pip install "quantllm[full]" # Everythingquantllm/
โโโ core/ # Core API
โ โโโ turbo_model.py # TurboModel unified API
โ โโโ smart_config.py # Auto-configuration
โโโ quant/ # Quantization
โ โโโ llama_cpp.py # GGUF conversion
โโโ hub/ # HuggingFace
โ โโโ hub_manager.py # Push/pull models
โ โโโ model_card.py # Auto model cards
โโโ kernels/ # Custom kernels
โ โโโ triton/ # Fused operations
โโโ utils/ # Utilities
โโโ progress.py # Beautiful UI
git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
pytestAreas for contribution:
- ๐ New model architectures
- ๐ง Performance optimizations
- ๐ Documentation
- ๐ Bug fixes
MIT License โ see LICENSE for details.
Made with ๐งก by Dark Coder
โญ Star on GitHub โข ๐ Report Bug โข ๐ Sponsor
Happy Quantizing! ๐