Skip to content

feat(quantized): add multi-dtype support for bf16/f16 activations#3310

Open
danielclough wants to merge 2 commits intohuggingface:mainfrom
danielclough:feat/quantized-multi-dtype
Open

feat(quantized): add multi-dtype support for bf16/f16 activations#3310
danielclough wants to merge 2 commits intohuggingface:mainfrom
danielclough:feat/quantized-multi-dtype

Conversation

@danielclough
Copy link
Contributor

@danielclough danielclough commented Jan 18, 2026

feat(quantized): add multi-dtype support for bf16/f16 activations

Summary

This PR enables running quantized models with different activation data types (f32, bf16, f16) via the new --dtype flag in the quantized example. Using half-precision activations can significantly improve inference speed on GPUs with fast fp16/bf16 tensor cores while maintaining model quality.

Key Changes

User-Facing

  • New --dtype flag in the quantized example to select activation precision:
    • --dtype f32 (default): Standard 32-bit floating point
    • --dtype bf16: BFloat16 - better numerical range, ideal for newer NVIDIA GPUs
    • --dtype f16: Float16 - maximum memory savings, widely supported

CUDA Kernel Enhancements

  • Extended quantized.cu with F16/BF16 output support for all quantized matmul kernels
  • Added specialized kernels that dequantize directly to half-precision formats
  • Significant kernel expansion (~1900 lines added) to support mixed-precision operations

Metal Kernel Enhancements

  • Updated quantized.metal with F16/BF16 output support for Apple Silicon
  • Extended kernel templates for half-precision dequantization

Core Infrastructure

  • QMatMul::forward() now handles dtype mismatches automatically via auto-conversion
  • New QMatMul::from_arc_with_transposed_data() for GGUF files from diffusion tools (stable-diffusion.cpp) that use different data layouts
  • New RmsNorm::from_qtensor_with_dtype() for eager dtype conversion at load time

Model Loading Improvements

  • ModelWeights::from_gguf() and ModelWeights::from_ggml() now auto-infers activation dtype from embedding tensor storage format (F16/BF16 embeddings → matching activation dtype)
  • New ModelWeights::from_gguf_with_dtype() and ModelWeights::from_ggml_with_dtype() for explicit dtype control
  • Pre-converted cos/sin/neg_inf tensors to target dtype during model loading to avoid runtime conversion overhead

Bug Fixes

  • Fixed dtype mismatch in LayerNorm operations
  • Fixed dtype mismatch in rms_norm operations
  • Fixed routing weights dtype in MoE models

Files Changed

File Changes
candle-core/src/quantized/cuda.rs +392 lines - CUDA backend dtype handling
candle-core/src/quantized/metal.rs +98 lines - Metal backend dtype handling
candle-core/src/quantized/mod.rs +30 lines - QMatMul dtype conversion
candle-examples/examples/quantized/main.rs +18 lines - CLI flag
candle-kernels/src/quantized.cu +1901 lines - CUDA kernels
candle-metal-kernels/src/kernels/quantized.rs +84 lines - Metal kernel bindings
candle-metal-kernels/src/metal_src/quantized.metal +1328 lines - Metal kernels
candle-nn/src/layer_norm.rs +17 lines - Dtype fixes
candle-nn/src/ops.rs +8 lines - Dtype fixes
candle-transformers/src/models/quantized_llama.rs +76 lines - Model loading
candle-transformers/src/quantized_nn.rs +11 lines - RmsNorm helper

Total: +3,441 lines, -522 lines across 11 files

Usage Example

# Run with BFloat16 activations for faster inference
cargo run --release --features cuda --example quantized -- \
    --model llama3.1-8b-instruct \
    --dtype bf16 \
    --prompt "Hello, world!"

# Run with Float16 activations
cargo run --release --features cuda --example quantized -- \
    --model llama3.1-8b-instruct \
    --dtype f16 \
    --prompt "Hello, world!"

Performance Impact

Using --dtype bf16 or --dtype f16 can provide:

  • Faster matrix multiplications on GPUs with dedicated tensor cores
  • Reduced memory bandwidth requirements
  • Lower overall memory footprint for activations

Compatibility Notes

  • GGUF files from llama.cpp: Work as expected with all dtype options
  • GGUF files from stable-diffusion.cpp: May require transposed data handling (use QMatMul::from_arc_with_transposed_data())
  • GGML files: Fully supported with the new dtype parameter

Testing

Tested with various quantized models (Q4_0, Q4_K_M, Q5_K_M, Q8_0) using f32, bf16, and f16 activation dtypes on both Metal and CUDA hardware.


Although this is unrelated to my code changes I'm changing it because it causes clippy to fail on mac:

98 -    #[error("{op} can only be performed on a single dimension")]                                                    
98 +    #[error("{op} can only be performed on a single dimension, found {dims:?}")]  

Clippy Error:

error: value assigned to `dims` is never read
  --> candle-core/src/error.rs:99:45
   |
99 |     OnlySingleDimension { op: &'static str, dims: Vec<usize> },
   |                                             ^^^^
   |
   = help: maybe it is overwritten before being read?
   = note: `-D unused-assignments` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unused_assignments)]`

error: could not compile `candle-core` (lib) due to 1 previous error

Enable running quantized models with different activation dtypes via
the new --dtype flag in the quantized example. This improves inference
speed on GPUs with fast fp16/bf16 tensor cores.

Changes:
- Add --dtype flag to quantized example (f32, bf16, f16)
- Add F16/BF16 output support to CUDA quantized matmul kernels
- Add dtype mismatch auto-conversion in QMatMul::forward()
- Add RmsNorm::from_qtensor_with_dtype() for eager dtype conversion
- Add transposed data layout handling for GGUF files from diffusion tools
- Fix dtype mismatch in LayerNorm and rms_norm operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant