Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: oisee/z80-optimizer
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.1.0
Choose a base ref
...
head repository: oisee/z80-optimizer
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.2.0
Choose a head ref
  • 15 commits
  • 27 files changed
  • 2 contributors

Commits on Mar 28, 2026

  1. Day 3: Focused search, branchless library, flag idioms, Vulkan gray_d…

    …ecode
    
    New kernels:
    - z80_focused.cu — sequential focused brute-force with per-target minimal op pools
      Targets: sqr_hi(±29), cbrt(±16), sin_q1(±68), gamma, smoothstep, antilog
      Features: --start/--end for GPU assignment, --gpu, auto depth limit
    - cpu_focused_i3.c — CPU 4-thread version for AMD i3 (no CUDA)
    - vulkan_graydec.c + graydec_search.comp — Vulkan compute gray_decode solver
      Found EXACT gray_decode in 13 ops, <1 second on RX 580!
    - z80_graydec_mini.c — CPU reference solver for gray_decode
    - focused_search.comp — generic Vulkan compute shader with op remapping
    
    Branchless library (exhaustive verified):
    - z80_branchless.c — ABS(6i,24T), MIN/MAX(8i,32T), CLAMP(16i,64T)
      CMOV CY?B:C (6i,24T), div3 EXACT (A×171>>9)
      All verified: ABS 256/256, MIN/MAX 65536/65536
    
    Flag materialization (exhaustive brute-force):
    - z80_flag_idioms.c — all flag↔register conversions, depth 6, 26 ops
      Key results: SBC A,A=CY→A(0xFF) 1i 4T, Z→CY PROVEN IMPOSSIBLE
      (Z flag is write-only on Z80 — no ALU instruction reads it)
      Verdict: CY > Z for bool, 0xFF/0x00 > 0/1 representation
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    d0a56ba View commit details
    Browse the repository at this point in the history
  2. Add GPU batch register allocator for VIR codegen integration

    z80_regalloc_batch.cu: exhaustive GPU register allocation evaluator
    - Input: JSON batch of functions with interference graphs + constraints
    - Output: optimal register assignment + cost per function
    - Format matches VIR BuildGPUDesc: edges, ops, fixed, nVregs, nLocs
    - Tested: abs_diff(5v,32K assignments), add_xy(3v), negate(2v)
    - Scales to 8+ vregs (16M assignments) in seconds on GPU
    - Ready for 500-function batch from MinZ compiler
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    79d130e View commit details
    Browse the repository at this point in the history
  3. Add complete Z80 register operation graph with multi-layer costs

    data/z80_register_graph.json: comprehensive register connectivity data
    - Move layer: all LD costs between 11 registers (A-L + IX/IY halves)
    - Move tricks: multi-instruction paths (H→IXH via EX DE,HL = 16T)
    - ALU layer: 8-bit ops (always through A), costs per operand register
    - CB prefix: shift/rotate/bit ops on any register (8T reg, 15T (HL))
    - 16-bit ALU: ADD/ADC/SBC HL,rr costs
    - Swap operations: EX DE,HL (4T), EX AF,AF' (4T), EXX (4T)
    - Dual accumulator patterns: parallel computation via EX AF,AF'
    - Shadow bank 32-bit arithmetic: HLH'L' + DED'E' via EXX
    - Cross-bank channels: A(0T), IX bridge(16T), stack(22T), memory(26T)
    
    Shortest path analysis: 38% of register pairs at 4T, 40% at 8T, 22% at 16T.
    ALU through A: natural=4T, via-A=12T, via-IX=20T.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    cd0f290 View commit details
    Browse the repository at this point in the history
  4. Add enrich-regalloc: post-process regalloc tables with operation-awar…

    …e costs
    
    cmd/enrich-regalloc/: reads exhaustive_Nv.bin tables, enriches each
    feasible assignment with ALU/INC/DEC costs from register graph model.
    
    Key findings from 4v enrichment (123K feasible shapes):
    - 45% of feasible shapes have NO accumulator (A) in assignment
      → ALU-infeasible without additional moves (early infeasibility detection!)
    - 49% have natural ALU pair (4T, operand already in A)
    - Worst vs best assignment: up to 8T per operation difference
    - Average saving from optimal assignment: 2T per operation
    
    Performance: 156K shapes in 0.5s, 17.4M shapes in 42s (pure Go, single core)
    
    Operation patterns scored:
    - alu_avg: average binary ALU cost across all variable pairs
    - a_centric: cost when one var is in A (accumulator-heavy code)
    - best/worst_alu_pair: range of ALU costs in assignment
    - inc_dec_total: INC/DEC cost (works on any register)
    - no_accumulator: flag for ALU-infeasible assignments
    
    Next: binary output format, operation_bag signature, GPU batch scoring
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    4659a55 View commit details
    Browse the repository at this point in the history
  5. Enrich regalloc v3: width-aware costs, mul/shadow/CALL/DJNZ analysis

    Width-dependent scoring:
    - u16 ADD natural (HL+rr=11T) vs u16 via u8 decompose (24T)
    - u16 SBC, MUL costs per register assignment
    - u16_pair_count, u16_slots_free for pair pressure analysis
    
    Idiom compatibility:
    - mul8_safe: 16% of 4v shapes can call mul8 without save/restore
    - mul8/mul16_conflicts: count of vregs in clobber zone {C,F,H,L}
    - From our data: all 254 mul8 preserve A, all DE-safe
    
    Shadow bank (EXX) enrichment:
    - exx_alu_cost: 12T per op via shadow (same as via-A, but preserves A')
    - exx_amortized: 8T/N ops when batching ALU in shadow bank
    
    CALL overhead: 0-63T per call site (avg 34T for 4v shapes!)
    - Counts PUSH/POP needed for caller-save registers
    - Key optimization target: better regalloc → fewer PUSH/POP
    
    DJNZ compatibility: 12% of shapes conflict with B (loop counter)
    
    Key finding: only 9% of 4v shapes are "ideal" (A + HL + mul8-safe)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    1cdc94b View commit details
    Browse the repository at this point in the history
  6. Enrich regalloc v4: smart CALL save strategy, 50% call overhead reduc…

    …tion
    
    CALL save/restore now uses cheapest available channel:
    1. Free register: LD free,r (8T) — cheapest, uses temp_regs_avail
    2. EX AF,AF': swap A+F to shadow (8T) — A-only
    3. IX/IY halves: LD IXH,r (16T) — up to 4 slots, no stack
    4. PUSH/POP: classic (21T) — fallback for remaining
    
    Result: avg CALL overhead 17T (was 34T with naive PUSH/POP) = 50% reduction!
    On 500 functions × ~2 CALLs = ~17,000T total savings.
    
    New metrics: call_regs_to_save, call_free_saves (avg 2 free regs per shape)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    0751490 View commit details
    Browse the repository at this point in the history
  7. Enriched register allocation tables: 37.6M shapes with operation-awar…

    …e costs
    
    Binary tables (data/*.enr.zst):
    - enriched_4v.enr.zst (168K): 123K feasible shapes, 15 metrics each
    - enriched_5v.enr.zst (22MB): 11.7M feasible shapes
    - enriched_6v_dense.enr.zst (56MB): 25.7M feasible shapes
    - Total: 37.6M shapes, 78MB compressed
    
    Key findings:
    - 43% of shapes lack A register → u8 ALU infeasible without moves
    - 21% lack HL pair → u16 ADD infeasible naturally
    - 7% mul8-safe (no clobber conflicts with mul8 {C,F,H,L})
    - Smart CALL save strategy: avg 17T (vs 34T naive) = 50% reduction
    
    Binary format (.enr): header + per-shape entries with:
    - Original assignment + flags bitfield (5 bits)
    - 12 cost metrics (uint16 each) in fixed order
    - Reader examples in data/ENRICHED_TABLES.md
    
    Usage: compilers, superoptimizers, decompilers, education
    Signature: (interference_shape, operation_bag) → O(1) lookup
    
    enrich-regalloc tool: -binary flag for .enr output
    Computed in ~10 minutes total on single CPU core.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    bb62fb5 View commit details
    Browse the repository at this point in the history
  8. Add deep dive article: Register Allocation as a Solved Game

    docs/regalloc_deep_dive.md: 620 lines, 12 chapters with Mermaid diagrams
    - Graph coloring basics → Z3 → GPU exhaustive → enriched tables
    - Complete O(1) lookup architecture with signature system
    - Comparison: SDCC vs GCC vs Z3 vs our approach
    - Width-aware feasibility (u8/u16/u32 constraints)
    - CALL save optimization (50% reduction)
    - WFC (Wave Function Collapse) as next frontier
    - Real-world data: Hobbit, demos, antique-toy listings
    - SBC A,A trick appendix
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    f13851c View commit details
    Browse the repository at this point in the history
  9. Register Allocation as a Solved Game — full paper with PDF/EPUB

    docs/regalloc_paper.md: 1574 lines, 15 chapters + 3 appendices
    docs/regalloc_paper.pdf: 679KB with 12 rendered Mermaid diagrams (PNG)
    docs/regalloc_paper.epub: 42KB
    docs/build_paper.sh: build script (mmdc → PNG → pandoc → lualatex)
    
    Comprehensive technical paper covering:
    - Graph coloring from basics to exhaustive enumeration
    - Z80 register architecture with complete cost model
    - Treewidth analysis (99.5% of real graphs have tw≤3)
    - GPU implementation (83.6M shapes, dual RTX 4060 Ti)
    - Enrichment: 37.6M shapes × 15 operation-aware metrics
    - O(1) lookup architecture replacing Z3 SAT solver
    - Comparison: SDCC vs GCC vs Z3 vs our approach
    - Real-world analysis: The Hobbit (1982), ZX demos
    - SBC A,A branchless foundation + Z flag write-only proof
    - Wave Function Collapse for future constraint propagation
    - Shadow registers and 32-bit arithmetic via EXX
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    532ddbb View commit details
    Browse the repository at this point in the history
  10. Add pRNG SEED search + CALL-chain simulation for demoscene

    cuda/z80_prng_search.cu: GPU brute-force for ZX Spectrum pRNG patterns
    - 3 modes: CMWC (Patrik Rak), XORShift, CALL-chain (Hole 17 style)
    - CALL-chain simulates RMDA's technique: CALL pushes address to screen-stack
    - CPL trick for address clamping (maps $00-$5A → $FF-$A5)
    - Fitness: Hamming distance, entropy, structure, symmetry
    - 640M seeds/sec on RTX 4060 Ti, exhaustive 4-byte in 7 seconds
    
    Inspired by .ded^RMDA (Maxim Muchkaev) Hole #17 enigma (LoveByte 2021)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    5fb7d1c View commit details
    Browse the repository at this point in the history
  11. Add GPU partition optimizer for 7-14v interference graphs (Level 3)

    cuda/z80_partition_opt.cu: brute-force optimal graph partitioning
    - Splits N-vreg interference graph into ≤6v subgraphs
    - Minimizes total cost = sum(partition costs) + boundary move costs
    - 3^N partition space: 7v=6K, 10v=59K, 14v=4.8M — all instant on GPU
    
    Tested on VIR corpus (20 functions, 7-14 vregs):
    - All find optimal partition in <1ms
    - Cost range: 32-264T, avg 118T
    - Largest: 14v → [2v + 6v + 6v], cost=260T
    
    Part of 5-level regalloc pipeline:
    L0: cut vertex decomposition (87%, free)
    L1: enriched table O(1) lookup (79%, ≤6v)
    L2: EXX 2-coloring (70% bipartite)
    L3: GPU partition optimizer (this, 7-14v)
    L4: Z3 fallback (<0.5%)
    
    Combined: 99%+ functions solved without Z3.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    2a7b4bb View commit details
    Browse the repository at this point in the history
  12. Expand partition optimizer to 24v/4 partitions, benchmark results

    Limits tested:
    - 14v: 0.7s (4.8M combos) — covers 100% of VIR corpus
    - 16v: 7.7s (43M combos)
    - 18v: 114s (387M combos) — practical exhaustive limit
    - 20v+: needs smart search (4^20 = 1.1T)
    
    VIR corpus max = 14v → exhaustive partition covers ALL functions.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    09d76a8 View commit details
    Browse the repository at this point in the history
  13. Add session wisdom dump + day 4 seed

    contexts/day3_wisdom.md: complete knowledge from day 3 session
    - All proofs (Z flag write-only, branch > branchless, phase transition)
    - Architecture decisions (5-level pipeline, VIR format, cost graph)
    - Key numbers (37.6M shapes, 246 signatures, move=34%, mul=0%)
    - File inventory (kernels, data, docs, tools)
    - Running overnight tasks
    
    contexts/day4_seed.md: prioritized next steps
    - P1: harvest overnight results (19v, 20v partitions)
    - P2: Go reader for enriched tables
    - P3: assignmentPerPartition for VIR
    - P4: corpus full evaluation
    - P5-P8: peephole rules, pRNG search, RL(IX+N),R, 6502
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    1d18cb3 View commit details
    Browse the repository at this point in the history
  14. Paper v2: add corpus validation, partition results, SDCC comparison

    Updated with production compiler data:
    - 820-function corpus analysis (246 unique signatures)
    - Operation distribution: move=34%, mul=0%
    - GPU partition: 19v optimal in 7min, 20v running
    - 5-level pipeline: 91% O(1), 99%+ optimal
    - SDCC 4.5 comparison (abs_diff +75%, mul3 +33%)
    - New abstract reflecting corpus validation
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    ab84508 View commit details
    Browse the repository at this point in the history
  15. Day 3 chronicle + v1.2.0 release prep

    Chronicle updated: 862 lines, 11 chapters covering all 3 days
    - Ch.11: Enrichment, gray_decode EXACT, Z flag proof, div3, corpus analysis
    - RMDA + Introspec chunk rendering analysis
    - "Register Allocation as a Solved Game" paper v2
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    oisee and claude committed Mar 28, 2026
    Configuration menu
    Copy the full SHA
    5e1cf35 View commit details
    Browse the repository at this point in the history
Loading