klein.c is a compressed pure C implementation of iris.c for text-to-image generation using FLUX.2 Klein transformer models. Built specifically for Windows with native Win32 GUI support.
klein.c is a full CPU inference pipeline that generates images from text prompts using the FLUX.2 diffusion transformer architecture. It is derived from and inspired by iris.c by Salvatore Sanfilippo (@antirez), with optimizations for CPU-only execution on Windows platforms.
- Pure CPU Inference: No GPU required - runs entirely on CPU using BLAS/OpenBLAS
- Windows Native GUI: Built-in Win32 graphical interface for easy image generation
- Memory-Efficient: Sequential model loading (Encoder -> Transformer -> VAE) to minimize RAM usage
- BF16 Hardware Detection: Automatic detection of AVX512-BF16 support (Intel Ice Lake+, AMD Zen 4+)
- High-Resolution Timers: Detailed benchmarking with Windows QueryPerformanceCounter APIs
- Multiple Output Formats: Saves images as both BMP and PNG formats
klein.c implements the complete FLUX.2 inference pipeline:
- Vocabulary: 151,936 tokens
- Hidden Size: 2,560
- Layers: 36 transformer layers
- Attention: 32 heads with 8 KV heads
- Sequence Length: 512 (padded)
- Output Layers: Layers 9, 18, 27 concatenated for final embeddings
- Embedding Dimension: 7,680 (3 × 2,560)
The tokenizer uses BPE (Byte Pair Encoding) with a custom vocabulary and merge table.
- Hidden Size: 3,072
- Attention Heads: 24
- Head Dimension: 128
- MLP Hidden: 9,216 (3× hidden)
- Double Blocks: 5 (joint image-text attention)
- Single Blocks: 20 (image-only attention)
- Latent Channels: 128
- RoPE Theta: 2,000
- Max Sequence: 52,000 tokens
The transformer uses rectified flow for faster convergence, predicting velocity instead of noise.
- Latent Channels: 32 → 128
- Base Channels: 128
- Channel Multipliers: [1, 2, 4, 4]
- Resolution: 8× spatial compression
- Residual Blocks: 2 per layer
- Attention Blocks: Included in decoder
Text Prompt
↓
[1] Qwen3 Encoder (load → encode → free)
↓
Text Embeddings [512, 7680]
↓
[2] FLUX Transformer (load → denoise → free)
↓
Denoised Latent [128, H/16, W/16]
↓
[3] VAE Decoder (load → decode → free)
↓
Final Image [3, H, W]
↓
Save as PNG/BMP
klein_cpu.exe <model_dir> [prompt] [-s steps] [-S seed] [-W width] [-H height]Arguments:
model_dir- Path to the FLUX.2 model directory (containing safetensors files)prompt- Text description of the image to generate (default: "a red apple")-s steps- Number of denoising steps (default: 1)-S seed- Random seed for reproducibility (default: 42)-W width- Output image width (default: 64)-H height- Output image height (default: 64)
Example:
klein_cpu.exe C:/models/flux-klein "a beautiful sunset over ocean" -s 4 -S 123 -W 512 -H 512Simply run klein_cpu.exe without arguments to launch the graphical interface:
klein_cpu.exe
The GUI provides:
- Text prompt input
- Model folder selection (with browse button)
- Width/Height/Seed/Steps configuration
- Generate button
- Status display with inference time
- Generated image preview
klein.c includes detailed timing for each pipeline stage:
================================================================================
PERFORMANCE TIMINGS
================================================================================
Encoder Loading: 8.50 seconds
Transformer Load: 15.20 seconds
VAE Loading: 12.30 seconds
---------------------------------------------------------------------------
Text Encoding: 2.10 seconds
Denoising: 45.00 seconds
VAE Decoding: 8.50 seconds
---------------------------------------------------------------------------
TOTAL INFERENCE: 91.60 seconds
================================================================================
The application automatically detects hardware support for BF16:
- Native (AVX512-BF16): Intel Ice Lake+ processors
- Emulated (F32): Older CPUs without BF16 support
klein.c requires the FLUX.2 Klein model files in safetensors format:
model_dir/
├── model.safetensors # Main model weights
├── tokenizer.json # BPE tokenizer
└── tokenizer_config.json # Tokenizer configuration
Expected tensor names:
encoder.*- Qwen3 encoder weightstransformer.*- FLUX transformer weightsvae.*- VAE decoder weights
klein.c uses a low-RAM sequential loading strategy:
- Load encoder → encode text → free encoder
- Load transformer → denoise → free transformer
- Load VAE → decode → free VAE
This approach keeps memory usage minimal by only having one model in memory at a time.
- QueryPerformanceCounter: High-resolution timing
- Win32 GUI: Native window with controls
- CreateProcess: Spawns CLI for generation from GUI
- SHBrowseForFolder: Folder browser dialog
- BMP/PNG Saving: Windows-compatible image formats
- Weights: Stored as FP16/BF16, converted to FP32 for computation
- Latents: Float32 throughout pipeline
- Attention: Flash attention style with proper masking
- Windows 10/11
- MSVC or MinGW-w64 compiler
- OpenBLAS (optional, for faster matrix operations)
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" # or "MinGW Makefiles"
cmake --build . --config Releaseklein.c/
├── main_cpu.c # Entry point + GUI implementation
├── klein_cpu.h # Header with all API definitions
├── klein_cpu.c # Implementation of all components
├── CMakeLists.txt # CMake build configuration
└── README.md # This file
| Feature | iris.c | klein.c (klein_cpu) |
|---|---|---|
| Platform | macOS/Linux | Windows |
| GPU | Metal (Apple Silicon) | CPU only |
| Dependencies | Optional BLAS | OpenBLAS (optional) |
| GUI | Terminal display | Win32 native GUI |
| Models | Multiple FLUX variants | FLUX.2 Klein focused |
| Memory | mmap support | Sequential loading |
- Original iris.c: Salvatore Sanfilippo (@antirez)
- FLUX.2 Models: Black Forest Labs
- klein.c/CPU Port: Camenduru
MIT License
This project is derived from iris.c which is also MIT licensed. See the original iris.c repository for more details.