Skip to content

camenduru/klein.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

klein.c - FLUX.2 Klein CPU Text-to-Image Generator

klein.c is a compressed pure C implementation of iris.c for text-to-image generation using FLUX.2 Klein transformer models. Built specifically for Windows with native Win32 GUI support.

Screenshot 2026-02-26 155405

Overview

klein.c is a full CPU inference pipeline that generates images from text prompts using the FLUX.2 diffusion transformer architecture. It is derived from and inspired by iris.c by Salvatore Sanfilippo (@antirez), with optimizations for CPU-only execution on Windows platforms.

Key Features

  • Pure CPU Inference: No GPU required - runs entirely on CPU using BLAS/OpenBLAS
  • Windows Native GUI: Built-in Win32 graphical interface for easy image generation
  • Memory-Efficient: Sequential model loading (Encoder -> Transformer -> VAE) to minimize RAM usage
  • BF16 Hardware Detection: Automatic detection of AVX512-BF16 support (Intel Ice Lake+, AMD Zen 4+)
  • High-Resolution Timers: Detailed benchmarking with Windows QueryPerformanceCounter APIs
  • Multiple Output Formats: Saves images as both BMP and PNG formats

Architecture

klein.c implements the complete FLUX.2 inference pipeline:

1. Qwen3 Text Encoder

  • Vocabulary: 151,936 tokens
  • Hidden Size: 2,560
  • Layers: 36 transformer layers
  • Attention: 32 heads with 8 KV heads
  • Sequence Length: 512 (padded)
  • Output Layers: Layers 9, 18, 27 concatenated for final embeddings
  • Embedding Dimension: 7,680 (3 × 2,560)

The tokenizer uses BPE (Byte Pair Encoding) with a custom vocabulary and merge table.

2. FLUX Transformer (Rectified Flow)

  • Hidden Size: 3,072
  • Attention Heads: 24
  • Head Dimension: 128
  • MLP Hidden: 9,216 (3× hidden)
  • Double Blocks: 5 (joint image-text attention)
  • Single Blocks: 20 (image-only attention)
  • Latent Channels: 128
  • RoPE Theta: 2,000
  • Max Sequence: 52,000 tokens

The transformer uses rectified flow for faster convergence, predicting velocity instead of noise.

3. VAE (Variational Autoencoder)

  • Latent Channels: 32 → 128
  • Base Channels: 128
  • Channel Multipliers: [1, 2, 4, 4]
  • Resolution: 8× spatial compression
  • Residual Blocks: 2 per layer
  • Attention Blocks: Included in decoder

Inference Flow

Text Prompt
    ↓
[1] Qwen3 Encoder (load → encode → free)
    ↓
Text Embeddings [512, 7680]
    ↓
[2] FLUX Transformer (load → denoise → free)
    ↓
Denoised Latent [128, H/16, W/16]
    ↓
[3] VAE Decoder (load → decode → free)
    ↓
Final Image [3, H, W]
    ↓
Save as PNG/BMP

Command-Line Usage

CLI Mode

klein_cpu.exe <model_dir> [prompt] [-s steps] [-S seed] [-W width] [-H height]

Arguments:

  • model_dir - Path to the FLUX.2 model directory (containing safetensors files)
  • prompt - Text description of the image to generate (default: "a red apple")
  • -s steps - Number of denoising steps (default: 1)
  • -S seed - Random seed for reproducibility (default: 42)
  • -W width - Output image width (default: 64)
  • -H height - Output image height (default: 64)

Example:

klein_cpu.exe C:/models/flux-klein "a beautiful sunset over ocean" -s 4 -S 123 -W 512 -H 512

GUI Mode

Simply run klein_cpu.exe without arguments to launch the graphical interface:

klein_cpu.exe

The GUI provides:

  • Text prompt input
  • Model folder selection (with browse button)
  • Width/Height/Seed/Steps configuration
  • Generate button
  • Status display with inference time
  • Generated image preview

Performance

Benchmarking Features

klein.c includes detailed timing for each pipeline stage:

================================================================================
  PERFORMANCE TIMINGS
================================================================================
  Encoder Loading:      8.50 seconds
  Transformer Load:    15.20 seconds
  VAE Loading:         12.30 seconds
  ---------------------------------------------------------------------------
  Text Encoding:        2.10 seconds
  Denoising:           45.00 seconds
  VAE Decoding:         8.50 seconds
  ---------------------------------------------------------------------------
  TOTAL INFERENCE:     91.60 seconds
================================================================================

BF16 Support Detection

The application automatically detects hardware support for BF16:

  • Native (AVX512-BF16): Intel Ice Lake+ processors
  • Emulated (F32): Older CPUs without BF16 support

Model Requirements

klein.c requires the FLUX.2 Klein model files in safetensors format:

model_dir/
├── model.safetensors          # Main model weights
├── tokenizer.json             # BPE tokenizer
└── tokenizer_config.json      # Tokenizer configuration

Expected tensor names:

  • encoder.* - Qwen3 encoder weights
  • transformer.* - FLUX transformer weights
  • vae.* - VAE decoder weights

Technical Details

Memory Management

klein.c uses a low-RAM sequential loading strategy:

  1. Load encoder → encode text → free encoder
  2. Load transformer → denoise → free transformer
  3. Load VAE → decode → free VAE

This approach keeps memory usage minimal by only having one model in memory at a time.

Windows Integration

  • QueryPerformanceCounter: High-resolution timing
  • Win32 GUI: Native window with controls
  • CreateProcess: Spawns CLI for generation from GUI
  • SHBrowseForFolder: Folder browser dialog
  • BMP/PNG Saving: Windows-compatible image formats

Data Types

  • Weights: Stored as FP16/BF16, converted to FP32 for computation
  • Latents: Float32 throughout pipeline
  • Attention: Flash attention style with proper masking

Building

Prerequisites

  • Windows 10/11
  • MSVC or MinGW-w64 compiler
  • OpenBLAS (optional, for faster matrix operations)

CMake Build

mkdir build
cd build
cmake .. -G "Visual Studio 17 2022"  # or "MinGW Makefiles"
cmake --build . --config Release

File Structure

klein.c/
├── main_cpu.c          # Entry point + GUI implementation
├── klein_cpu.h         # Header with all API definitions
├── klein_cpu.c         # Implementation of all components
├── CMakeLists.txt      # CMake build configuration
└── README.md           # This file

Comparison with iris.c

Feature iris.c klein.c (klein_cpu)
Platform macOS/Linux Windows
GPU Metal (Apple Silicon) CPU only
Dependencies Optional BLAS OpenBLAS (optional)
GUI Terminal display Win32 native GUI
Models Multiple FLUX variants FLUX.2 Klein focused
Memory mmap support Sequential loading

Credits

License

MIT License


This project is derived from iris.c which is also MIT licensed. See the original iris.c repository for more details.

About

klein.c is a compressed pure C implementation of iris.c for text-to-image generation using FLUX.2 Klein transformer models. Built specifically for Windows with native Win32 GUI support.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors