Comparing changes

…ecode New kernels: - z80_focused.cu — sequential focused brute-force with per-target minimal op pools Targets: sqr_hi(±29), cbrt(±16), sin_q1(±68), gamma, smoothstep, antilog Features: --start/--end for GPU assignment, --gpu, auto depth limit - cpu_focused_i3.c — CPU 4-thread version for AMD i3 (no CUDA) - vulkan_graydec.c + graydec_search.comp — Vulkan compute gray_decode solver Found EXACT gray_decode in 13 ops, <1 second on RX 580! - z80_graydec_mini.c — CPU reference solver for gray_decode - focused_search.comp — generic Vulkan compute shader with op remapping Branchless library (exhaustive verified): - z80_branchless.c — ABS(6i,24T), MIN/MAX(8i,32T), CLAMP(16i,64T) CMOV CY?B:C (6i,24T), div3 EXACT (A×171>>9) All verified: ABS 256/256, MIN/MAX 65536/65536 Flag materialization (exhaustive brute-force): - z80_flag_idioms.c — all flag↔register conversions, depth 6, 26 ops Key results: SBC A,A=CY→A(0xFF) 1i 4T, Z→CY PROVEN IMPOSSIBLE (Z flag is write-only on Z80 — no ALU instruction reads it) Verdict: CY > Z for bool, 0xFF/0x00 > 0/1 representation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

z80_regalloc_batch.cu: exhaustive GPU register allocation evaluator - Input: JSON batch of functions with interference graphs + constraints - Output: optimal register assignment + cost per function - Format matches VIR BuildGPUDesc: edges, ops, fixed, nVregs, nLocs - Tested: abs_diff(5v,32K assignments), add_xy(3v), negate(2v) - Scales to 8+ vregs (16M assignments) in seconds on GPU - Ready for 500-function batch from MinZ compiler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

data/z80_register_graph.json: comprehensive register connectivity data - Move layer: all LD costs between 11 registers (A-L + IX/IY halves) - Move tricks: multi-instruction paths (H→IXH via EX DE,HL = 16T) - ALU layer: 8-bit ops (always through A), costs per operand register - CB prefix: shift/rotate/bit ops on any register (8T reg, 15T (HL)) - 16-bit ALU: ADD/ADC/SBC HL,rr costs - Swap operations: EX DE,HL (4T), EX AF,AF' (4T), EXX (4T) - Dual accumulator patterns: parallel computation via EX AF,AF' - Shadow bank 32-bit arithmetic: HLH'L' + DED'E' via EXX - Cross-bank channels: A(0T), IX bridge(16T), stack(22T), memory(26T) Shortest path analysis: 38% of register pairs at 4T, 40% at 8T, 22% at 16T. ALU through A: natural=4T, via-A=12T, via-IX=20T. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e costs cmd/enrich-regalloc/: reads exhaustive_Nv.bin tables, enriches each feasible assignment with ALU/INC/DEC costs from register graph model. Key findings from 4v enrichment (123K feasible shapes): - 45% of feasible shapes have NO accumulator (A) in assignment → ALU-infeasible without additional moves (early infeasibility detection!) - 49% have natural ALU pair (4T, operand already in A) - Worst vs best assignment: up to 8T per operation difference - Average saving from optimal assignment: 2T per operation Performance: 156K shapes in 0.5s, 17.4M shapes in 42s (pure Go, single core) Operation patterns scored: - alu_avg: average binary ALU cost across all variable pairs - a_centric: cost when one var is in A (accumulator-heavy code) - best/worst_alu_pair: range of ALU costs in assignment - inc_dec_total: INC/DEC cost (works on any register) - no_accumulator: flag for ALU-infeasible assignments Next: binary output format, operation_bag signature, GPU batch scoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Width-dependent scoring: - u16 ADD natural (HL+rr=11T) vs u16 via u8 decompose (24T) - u16 SBC, MUL costs per register assignment - u16_pair_count, u16_slots_free for pair pressure analysis Idiom compatibility: - mul8_safe: 16% of 4v shapes can call mul8 without save/restore - mul8/mul16_conflicts: count of vregs in clobber zone {C,F,H,L} - From our data: all 254 mul8 preserve A, all DE-safe Shadow bank (EXX) enrichment: - exx_alu_cost: 12T per op via shadow (same as via-A, but preserves A') - exx_amortized: 8T/N ops when batching ALU in shadow bank CALL overhead: 0-63T per call site (avg 34T for 4v shapes!) - Counts PUSH/POP needed for caller-save registers - Key optimization target: better regalloc → fewer PUSH/POP DJNZ compatibility: 12% of shapes conflict with B (loop counter) Key finding: only 9% of 4v shapes are "ideal" (A + HL + mul8-safe) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion CALL save/restore now uses cheapest available channel: 1. Free register: LD free,r (8T) — cheapest, uses temp_regs_avail 2. EX AF,AF': swap A+F to shadow (8T) — A-only 3. IX/IY halves: LD IXH,r (16T) — up to 4 slots, no stack 4. PUSH/POP: classic (21T) — fallback for remaining Result: avg CALL overhead 17T (was 34T with naive PUSH/POP) = 50% reduction! On 500 functions × ~2 CALLs = ~17,000T total savings. New metrics: call_regs_to_save, call_free_saves (avg 2 free regs per shape) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e costs Binary tables (data/*.enr.zst): - enriched_4v.enr.zst (168K): 123K feasible shapes, 15 metrics each - enriched_5v.enr.zst (22MB): 11.7M feasible shapes - enriched_6v_dense.enr.zst (56MB): 25.7M feasible shapes - Total: 37.6M shapes, 78MB compressed Key findings: - 43% of shapes lack A register → u8 ALU infeasible without moves - 21% lack HL pair → u16 ADD infeasible naturally - 7% mul8-safe (no clobber conflicts with mul8 {C,F,H,L}) - Smart CALL save strategy: avg 17T (vs 34T naive) = 50% reduction Binary format (.enr): header + per-shape entries with: - Original assignment + flags bitfield (5 bits) - 12 cost metrics (uint16 each) in fixed order - Reader examples in data/ENRICHED_TABLES.md Usage: compilers, superoptimizers, decompilers, education Signature: (interference_shape, operation_bag) → O(1) lookup enrich-regalloc tool: -binary flag for .enr output Computed in ~10 minutes total on single CPU core. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs/regalloc_deep_dive.md: 620 lines, 12 chapters with Mermaid diagrams - Graph coloring basics → Z3 → GPU exhaustive → enriched tables - Complete O(1) lookup architecture with signature system - Comparison: SDCC vs GCC vs Z3 vs our approach - Width-aware feasibility (u8/u16/u32 constraints) - CALL save optimization (50% reduction) - WFC (Wave Function Collapse) as next frontier - Real-world data: Hobbit, demos, antique-toy listings - SBC A,A trick appendix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs/regalloc_paper.md: 1574 lines, 15 chapters + 3 appendices docs/regalloc_paper.pdf: 679KB with 12 rendered Mermaid diagrams (PNG) docs/regalloc_paper.epub: 42KB docs/build_paper.sh: build script (mmdc → PNG → pandoc → lualatex) Comprehensive technical paper covering: - Graph coloring from basics to exhaustive enumeration - Z80 register architecture with complete cost model - Treewidth analysis (99.5% of real graphs have tw≤3) - GPU implementation (83.6M shapes, dual RTX 4060 Ti) - Enrichment: 37.6M shapes × 15 operation-aware metrics - O(1) lookup architecture replacing Z3 SAT solver - Comparison: SDCC vs GCC vs Z3 vs our approach - Real-world analysis: The Hobbit (1982), ZX demos - SBC A,A branchless foundation + Z flag write-only proof - Wave Function Collapse for future constraint propagation - Shadow registers and 32-bit arithmetic via EXX Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cuda/z80_prng_search.cu: GPU brute-force for ZX Spectrum pRNG patterns - 3 modes: CMWC (Patrik Rak), XORShift, CALL-chain (Hole 17 style) - CALL-chain simulates RMDA's technique: CALL pushes address to screen-stack - CPL trick for address clamping (maps $00-$5A → $FF-$A5) - Fitness: Hamming distance, entropy, structure, symmetry - 640M seeds/sec on RTX 4060 Ti, exhaustive 4-byte in 7 seconds Inspired by .ded^RMDA (Maxim Muchkaev) Hole #17 enigma (LoveByte 2021) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cuda/z80_partition_opt.cu: brute-force optimal graph partitioning - Splits N-vreg interference graph into ≤6v subgraphs - Minimizes total cost = sum(partition costs) + boundary move costs - 3^N partition space: 7v=6K, 10v=59K, 14v=4.8M — all instant on GPU Tested on VIR corpus (20 functions, 7-14 vregs): - All find optimal partition in <1ms - Cost range: 32-264T, avg 118T - Largest: 14v → [2v + 6v + 6v], cost=260T Part of 5-level regalloc pipeline: L0: cut vertex decomposition (87%, free) L1: enriched table O(1) lookup (79%, ≤6v) L2: EXX 2-coloring (70% bipartite) L3: GPU partition optimizer (this, 7-14v) L4: Z3 fallback (<0.5%) Combined: 99%+ functions solved without Z3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Limits tested: - 14v: 0.7s (4.8M combos) — covers 100% of VIR corpus - 16v: 7.7s (43M combos) - 18v: 114s (387M combos) — practical exhaustive limit - 20v+: needs smart search (4^20 = 1.1T) VIR corpus max = 14v → exhaustive partition covers ALL functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

contexts/day3_wisdom.md: complete knowledge from day 3 session - All proofs (Z flag write-only, branch > branchless, phase transition) - Architecture decisions (5-level pipeline, VIR format, cost graph) - Key numbers (37.6M shapes, 246 signatures, move=34%, mul=0%) - File inventory (kernels, data, docs, tools) - Running overnight tasks contexts/day4_seed.md: prioritized next steps - P1: harvest overnight results (19v, 20v partitions) - P2: Go reader for enriched tables - P3: assignmentPerPartition for VIR - P4: corpus full evaluation - P5-P8: peephole rules, pRNG search, RL(IX+N),R, 6502 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated with production compiler data: - 820-function corpus analysis (246 unique signatures) - Operation distribution: move=34%, mul=0% - GPU partition: 19v optimal in 7min, 20v running - 5-level pipeline: 91% O(1), 99%+ optimal - SDCC 4.5 comparison (abs_diff +75%, mul3 +33%) - New abstract reflecting corpus validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Chronicle updated: 862 lines, 11 chapters covering all 3 days - Ch.11: Enrichment, gray_decode EXACT, Z flag proof, div3, corpus analysis - RMDA + Introspec chunk rendering analysis - "Register Allocation as a Solved Game" paper v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Mar 28, 2026

This comparison is taking too long to generate.

Uh oh!