-
Notifications
You must be signed in to change notification settings - Fork 0
Comparing changes
Open a pull request
base repository: oisee/z80-optimizer
base: v1.1.0
head repository: oisee/z80-optimizer
compare: v1.2.0
- 15 commits
- 27 files changed
- 2 contributors
Commits on Mar 28, 2026
-
Day 3: Focused search, branchless library, flag idioms, Vulkan gray_d…
…ecode New kernels: - z80_focused.cu — sequential focused brute-force with per-target minimal op pools Targets: sqr_hi(±29), cbrt(±16), sin_q1(±68), gamma, smoothstep, antilog Features: --start/--end for GPU assignment, --gpu, auto depth limit - cpu_focused_i3.c — CPU 4-thread version for AMD i3 (no CUDA) - vulkan_graydec.c + graydec_search.comp — Vulkan compute gray_decode solver Found EXACT gray_decode in 13 ops, <1 second on RX 580! - z80_graydec_mini.c — CPU reference solver for gray_decode - focused_search.comp — generic Vulkan compute shader with op remapping Branchless library (exhaustive verified): - z80_branchless.c — ABS(6i,24T), MIN/MAX(8i,32T), CLAMP(16i,64T) CMOV CY?B:C (6i,24T), div3 EXACT (A×171>>9) All verified: ABS 256/256, MIN/MAX 65536/65536 Flag materialization (exhaustive brute-force): - z80_flag_idioms.c — all flag↔register conversions, depth 6, 26 ops Key results: SBC A,A=CY→A(0xFF) 1i 4T, Z→CY PROVEN IMPOSSIBLE (Z flag is write-only on Z80 — no ALU instruction reads it) Verdict: CY > Z for bool, 0xFF/0x00 > 0/1 representation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for d0a56ba - Browse repository at this point
Copy the full SHA d0a56baView commit details -
Add GPU batch register allocator for VIR codegen integration
z80_regalloc_batch.cu: exhaustive GPU register allocation evaluator - Input: JSON batch of functions with interference graphs + constraints - Output: optimal register assignment + cost per function - Format matches VIR BuildGPUDesc: edges, ops, fixed, nVregs, nLocs - Tested: abs_diff(5v,32K assignments), add_xy(3v), negate(2v) - Scales to 8+ vregs (16M assignments) in seconds on GPU - Ready for 500-function batch from MinZ compiler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 79d130e - Browse repository at this point
Copy the full SHA 79d130eView commit details -
Add complete Z80 register operation graph with multi-layer costs
data/z80_register_graph.json: comprehensive register connectivity data - Move layer: all LD costs between 11 registers (A-L + IX/IY halves) - Move tricks: multi-instruction paths (H→IXH via EX DE,HL = 16T) - ALU layer: 8-bit ops (always through A), costs per operand register - CB prefix: shift/rotate/bit ops on any register (8T reg, 15T (HL)) - 16-bit ALU: ADD/ADC/SBC HL,rr costs - Swap operations: EX DE,HL (4T), EX AF,AF' (4T), EXX (4T) - Dual accumulator patterns: parallel computation via EX AF,AF' - Shadow bank 32-bit arithmetic: HLH'L' + DED'E' via EXX - Cross-bank channels: A(0T), IX bridge(16T), stack(22T), memory(26T) Shortest path analysis: 38% of register pairs at 4T, 40% at 8T, 22% at 16T. ALU through A: natural=4T, via-A=12T, via-IX=20T. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for cd0f290 - Browse repository at this point
Copy the full SHA cd0f290View commit details -
Add enrich-regalloc: post-process regalloc tables with operation-awar…
…e costs cmd/enrich-regalloc/: reads exhaustive_Nv.bin tables, enriches each feasible assignment with ALU/INC/DEC costs from register graph model. Key findings from 4v enrichment (123K feasible shapes): - 45% of feasible shapes have NO accumulator (A) in assignment → ALU-infeasible without additional moves (early infeasibility detection!) - 49% have natural ALU pair (4T, operand already in A) - Worst vs best assignment: up to 8T per operation difference - Average saving from optimal assignment: 2T per operation Performance: 156K shapes in 0.5s, 17.4M shapes in 42s (pure Go, single core) Operation patterns scored: - alu_avg: average binary ALU cost across all variable pairs - a_centric: cost when one var is in A (accumulator-heavy code) - best/worst_alu_pair: range of ALU costs in assignment - inc_dec_total: INC/DEC cost (works on any register) - no_accumulator: flag for ALU-infeasible assignments Next: binary output format, operation_bag signature, GPU batch scoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 4659a55 - Browse repository at this point
Copy the full SHA 4659a55View commit details -
Enrich regalloc v3: width-aware costs, mul/shadow/CALL/DJNZ analysis
Width-dependent scoring: - u16 ADD natural (HL+rr=11T) vs u16 via u8 decompose (24T) - u16 SBC, MUL costs per register assignment - u16_pair_count, u16_slots_free for pair pressure analysis Idiom compatibility: - mul8_safe: 16% of 4v shapes can call mul8 without save/restore - mul8/mul16_conflicts: count of vregs in clobber zone {C,F,H,L} - From our data: all 254 mul8 preserve A, all DE-safe Shadow bank (EXX) enrichment: - exx_alu_cost: 12T per op via shadow (same as via-A, but preserves A') - exx_amortized: 8T/N ops when batching ALU in shadow bank CALL overhead: 0-63T per call site (avg 34T for 4v shapes!) - Counts PUSH/POP needed for caller-save registers - Key optimization target: better regalloc → fewer PUSH/POP DJNZ compatibility: 12% of shapes conflict with B (loop counter) Key finding: only 9% of 4v shapes are "ideal" (A + HL + mul8-safe) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>Configuration menu - View commit details
-
Copy full SHA for 1cdc94b - Browse repository at this point
Copy the full SHA 1cdc94bView commit details -
Enrich regalloc v4: smart CALL save strategy, 50% call overhead reduc…
…tion CALL save/restore now uses cheapest available channel: 1. Free register: LD free,r (8T) — cheapest, uses temp_regs_avail 2. EX AF,AF': swap A+F to shadow (8T) — A-only 3. IX/IY halves: LD IXH,r (16T) — up to 4 slots, no stack 4. PUSH/POP: classic (21T) — fallback for remaining Result: avg CALL overhead 17T (was 34T with naive PUSH/POP) = 50% reduction! On 500 functions × ~2 CALLs = ~17,000T total savings. New metrics: call_regs_to_save, call_free_saves (avg 2 free regs per shape) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 0751490 - Browse repository at this point
Copy the full SHA 0751490View commit details -
Enriched register allocation tables: 37.6M shapes with operation-awar…
…e costs Binary tables (data/*.enr.zst): - enriched_4v.enr.zst (168K): 123K feasible shapes, 15 metrics each - enriched_5v.enr.zst (22MB): 11.7M feasible shapes - enriched_6v_dense.enr.zst (56MB): 25.7M feasible shapes - Total: 37.6M shapes, 78MB compressed Key findings: - 43% of shapes lack A register → u8 ALU infeasible without moves - 21% lack HL pair → u16 ADD infeasible naturally - 7% mul8-safe (no clobber conflicts with mul8 {C,F,H,L}) - Smart CALL save strategy: avg 17T (vs 34T naive) = 50% reduction Binary format (.enr): header + per-shape entries with: - Original assignment + flags bitfield (5 bits) - 12 cost metrics (uint16 each) in fixed order - Reader examples in data/ENRICHED_TABLES.md Usage: compilers, superoptimizers, decompilers, education Signature: (interference_shape, operation_bag) → O(1) lookup enrich-regalloc tool: -binary flag for .enr output Computed in ~10 minutes total on single CPU core. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>Configuration menu - View commit details
-
Copy full SHA for bb62fb5 - Browse repository at this point
Copy the full SHA bb62fb5View commit details -
Add deep dive article: Register Allocation as a Solved Game
docs/regalloc_deep_dive.md: 620 lines, 12 chapters with Mermaid diagrams - Graph coloring basics → Z3 → GPU exhaustive → enriched tables - Complete O(1) lookup architecture with signature system - Comparison: SDCC vs GCC vs Z3 vs our approach - Width-aware feasibility (u8/u16/u32 constraints) - CALL save optimization (50% reduction) - WFC (Wave Function Collapse) as next frontier - Real-world data: Hobbit, demos, antique-toy listings - SBC A,A trick appendix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for f13851c - Browse repository at this point
Copy the full SHA f13851cView commit details -
Register Allocation as a Solved Game — full paper with PDF/EPUB
docs/regalloc_paper.md: 1574 lines, 15 chapters + 3 appendices docs/regalloc_paper.pdf: 679KB with 12 rendered Mermaid diagrams (PNG) docs/regalloc_paper.epub: 42KB docs/build_paper.sh: build script (mmdc → PNG → pandoc → lualatex) Comprehensive technical paper covering: - Graph coloring from basics to exhaustive enumeration - Z80 register architecture with complete cost model - Treewidth analysis (99.5% of real graphs have tw≤3) - GPU implementation (83.6M shapes, dual RTX 4060 Ti) - Enrichment: 37.6M shapes × 15 operation-aware metrics - O(1) lookup architecture replacing Z3 SAT solver - Comparison: SDCC vs GCC vs Z3 vs our approach - Real-world analysis: The Hobbit (1982), ZX demos - SBC A,A branchless foundation + Z flag write-only proof - Wave Function Collapse for future constraint propagation - Shadow registers and 32-bit arithmetic via EXX Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 532ddbb - Browse repository at this point
Copy the full SHA 532ddbbView commit details -
Add pRNG SEED search + CALL-chain simulation for demoscene
cuda/z80_prng_search.cu: GPU brute-force for ZX Spectrum pRNG patterns - 3 modes: CMWC (Patrik Rak), XORShift, CALL-chain (Hole 17 style) - CALL-chain simulates RMDA's technique: CALL pushes address to screen-stack - CPL trick for address clamping (maps $00-$5A → $FF-$A5) - Fitness: Hamming distance, entropy, structure, symmetry - 640M seeds/sec on RTX 4060 Ti, exhaustive 4-byte in 7 seconds Inspired by .ded^RMDA (Maxim Muchkaev) Hole #17 enigma (LoveByte 2021) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 5fb7d1c - Browse repository at this point
Copy the full SHA 5fb7d1cView commit details -
Add GPU partition optimizer for 7-14v interference graphs (Level 3)
cuda/z80_partition_opt.cu: brute-force optimal graph partitioning - Splits N-vreg interference graph into ≤6v subgraphs - Minimizes total cost = sum(partition costs) + boundary move costs - 3^N partition space: 7v=6K, 10v=59K, 14v=4.8M — all instant on GPU Tested on VIR corpus (20 functions, 7-14 vregs): - All find optimal partition in <1ms - Cost range: 32-264T, avg 118T - Largest: 14v → [2v + 6v + 6v], cost=260T Part of 5-level regalloc pipeline: L0: cut vertex decomposition (87%, free) L1: enriched table O(1) lookup (79%, ≤6v) L2: EXX 2-coloring (70% bipartite) L3: GPU partition optimizer (this, 7-14v) L4: Z3 fallback (<0.5%) Combined: 99%+ functions solved without Z3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 2a7b4bb - Browse repository at this point
Copy the full SHA 2a7b4bbView commit details -
Expand partition optimizer to 24v/4 partitions, benchmark results
Limits tested: - 14v: 0.7s (4.8M combos) — covers 100% of VIR corpus - 16v: 7.7s (43M combos) - 18v: 114s (387M combos) — practical exhaustive limit - 20v+: needs smart search (4^20 = 1.1T) VIR corpus max = 14v → exhaustive partition covers ALL functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 09d76a8 - Browse repository at this point
Copy the full SHA 09d76a8View commit details -
Add session wisdom dump + day 4 seed
contexts/day3_wisdom.md: complete knowledge from day 3 session - All proofs (Z flag write-only, branch > branchless, phase transition) - Architecture decisions (5-level pipeline, VIR format, cost graph) - Key numbers (37.6M shapes, 246 signatures, move=34%, mul=0%) - File inventory (kernels, data, docs, tools) - Running overnight tasks contexts/day4_seed.md: prioritized next steps - P1: harvest overnight results (19v, 20v partitions) - P2: Go reader for enriched tables - P3: assignmentPerPartition for VIR - P4: corpus full evaluation - P5-P8: peephole rules, pRNG search, RL(IX+N),R, 6502 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 1d18cb3 - Browse repository at this point
Copy the full SHA 1d18cb3View commit details -
Paper v2: add corpus validation, partition results, SDCC comparison
Updated with production compiler data: - 820-function corpus analysis (246 unique signatures) - Operation distribution: move=34%, mul=0% - GPU partition: 19v optimal in 7min, 20v running - 5-level pipeline: 91% O(1), 99%+ optimal - SDCC 4.5 comparison (abs_diff +75%, mul3 +33%) - New abstract reflecting corpus validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for ab84508 - Browse repository at this point
Copy the full SHA ab84508View commit details -
Day 3 chronicle + v1.2.0 release prep
Chronicle updated: 862 lines, 11 chapters covering all 3 days - Ch.11: Enrichment, gray_decode EXACT, Z flag proof, div3, corpus analysis - RMDA + Introspec chunk rendering analysis - "Register Allocation as a Solved Game" paper v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 5e1cf35 - Browse repository at this point
Copy the full SHA 5e1cf35View commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff v1.1.0...v1.2.0