Skip to content

Latest commit

 

History

History
619 lines (514 loc) · 22.4 KB

File metadata and controls

619 lines (514 loc) · 22.4 KB

ADR-003: Caching and Query Optimization

Status

Proposed

Context

Analytics workloads exhibit predictable patterns that can be exploited for performance gains:

Observed Patterns

  1. Repeated Analysis: Users often run similar queries on same datasets
  2. Incremental Exploration: Drill-down queries share base filters
  3. Schema Introspection: Dataset profiles are requested multiple times
  4. Visualization Regeneration: Same charts requested with minor variations

Current Problems

Issue Impact Frequency
Re-parsing large CSVs 5-30s delay Every query
Re-executing identical queries Wasted compute Common
Re-transferring unchanged results Token waste Every round-trip
Re-generating same visualizations CPU + time Chart refreshes
Cold starts on dataset access Latency First access

Performance Targets

Metric Current Target Improvement
Repeated query latency Full execution <100ms 100x
Dataset re-parse time 5-30s 0s (cached)
Cache hit ratio 0% >60% New capability
Memory overhead N/A <1GB Bounded

Decision

Implement a multi-tier caching system with query optimization, result caching, and intelligent prefetching.

Caching Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Multi-Tier Caching System                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TIER 1: IN-MEMORY CACHE (L1)                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  • Dataset schemas                                                  │   │
│  │  • Parsed dataframes (polars LazyFrames)                            │   │
│  │  • Query execution plans                                            │   │
│  │  TTL: 5-15 minutes    Max Size: 512MB    Eviction: LRU              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  TIER 2: LOCAL DISK CACHE (L2)                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  • Query results (Parquet format)                                   │   │
│  │  • Serialized dataframes                                            │   │
│  │  • Visualization outputs                                            │   │
│  │  TTL: 1-24 hours      Max Size: 5GB      Eviction: LFU              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  TIER 3: QUERY PLAN CACHE                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  • Optimized execution plans                                        │   │
│  │  • Compiled SQL expressions                                         │   │
│  │  • Predicate pushdown results                                       │   │
│  │  TTL: Session-based   Max Size: 100MB    Eviction: FIFO             │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Cache Key Strategy

def generate_cache_key(query_spec: dict) -> str:
    """
    Deterministic cache key generation
    """
    components = [
        # File checksums (mtime + size based)
        hash_files(query_spec['files']),
        
        # Normalized query
        normalize_query(query_spec['query']),
        
        # Engine and options
        query_spec.get('engine', 'auto'),
        query_spec.get('return_limit', 100),
        
        # Result format (different formats = different cache entries)
        query_spec.get('return_format', 'json'),
    ]
    
    return hashlib.sha256(
        '|'.join(str(c) for c in components).encode()
    ).hexdigest()[:32]

def hash_files(file_paths: list) -> str:
    """
    Fast file fingerprint using mtime and size
    """
    fingerprints = []
    for path in sorted(file_paths):
        stat = os.stat(path)
        fingerprints.append(f"{path}:{stat.st_mtime}:{stat.st_size}")
    return hashlib.sha256('|'.join(fingerprints).encode()).hexdigest()[:16]

Dataset Metadata Cache

@dataclass
class DatasetCacheEntry:
    file_path: str
    fingerprint: str  # mtime + size hash
    schema: Schema
    row_count: int
    column_stats: Dict[str, ColumnStats]
    sample_data: pd.DataFrame
    parsed_lazyframe: Optional[pl.LazyFrame]
    cached_at: datetime
    access_count: int
    last_accessed: datetime
    
    @property
    def is_stale(self) -> bool:
        # Check if file has been modified
        current_stat = os.stat(self.file_path)
        current_fingerprint = f"{current_stat.st_mtime}:{current_stat.st_size}"
        return current_fingerprint != self.fingerprint

Query Optimization Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                  Query Optimization Flow                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Query                                                     │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────┐                                        │
│  │ 1. Parse & Normalize │  • Standardize SQL/expr syntax          │
│  └──────────┬──────────┘  • Extract predicates                    │
│             │                                                    │
│             ▼                                                    │
│  ┌─────────────────────┐                                        │
│  │ 2. Check L1 Cache    │  • Exact match? Return cached          │
│  └──────────┬──────────┘  • Partial match? Use as base           │
│             │                                                    │
│             ▼                                                    │
│  ┌─────────────────────┐                                        │
│  │ 3. Optimize Plan     │  • Predicate pushdown                  │
│  └──────────┬──────────┘  • Column pruning                       │
│             │             • Partition elimination                │
│             ▼                                                    │
│  ┌─────────────────────┐                                        │
│  │ 4. Execute          │  • Use cached dataset if available      │
│  └──────────┬──────────┘  • Stream if needed                     │
│             │                                                    │
│             ▼                                                    │
│  ┌─────────────────────┐                                        │
│  │ 5. Cache Result      │  • Store in L1 (small results)          │
│  └──────────┬──────────┘  • Store in L2 (large results)          │
│             │             • Update metadata                      │
│             ▼                                                    │
│  Output Result                                                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Query Rewriting Examples

Example 1: Predicate Pushdown

-- Original Query
SELECT category, SUM(revenue) 
FROM sales 
WHERE date >= '2024-01-01'
GROUP BY category;

-- Optimized (Polars LazyFrame)
(df
 .lazy()
 .filter(pl.col('date') >= '2024-01-01')  # Pushdown to scan
 .group_by('category')
 .agg(pl.col('revenue').sum())
 .collect())

Example 2: Column Pruning

# Original: Load all columns
df = pl.read_csv('large_file.csv')
result = df.select(['name', 'revenue'])

# Optimized: Scan only needed columns
df = pl.scan_csv('large_file.csv', columns=['name', 'revenue'])
result = df.collect()

Example 3: Incremental Aggregation

# User queries:
# Q1: Total revenue by category
# Q2: Total revenue by category (with date filter)

# Q1 executes full scan → caches result
# Q2 uses cached aggregated data + applies filter → fast

Cache Storage Implementation

interface CacheManager {
  // L1: In-memory cache
  l1Cache: Map<string, L1CacheEntry>;
  l1MaxSize: number;
  l1CurrentSize: number;
  
  // L2: Disk cache
  l2CachePath: string;
  l2MaxSize: number;
  
  // Operations
  get<T>(key: string, tier: 'L1' | 'L2' | 'any'): Promise<T | null>;
  set<T>(key: string, value: T, tier: 'L1' | 'L2', ttl: number): Promise<void>;
  invalidate(pattern: string): Promise<number>;
  clear(): Promise<void>;
}

class AnalyticsCacheManager implements CacheManager {
  private l1Cache: Map<string, CacheEntry> = new Map();
  private l1MaxSize = 512 * 1024 * 1024; // 512MB
  private l2CachePath: string;
  
  async get<T>(key: string, tier: 'L1' | 'L2' | 'any'): Promise<T | null> {
    // Try L1 first
    if (tier !== 'L2') {
      const l1Entry = this.l1Cache.get(key);
      if (l1Entry && !this.isExpired(l1Entry)) {
        l1Entry.lastAccessed = new Date();
        l1Entry.accessCount++;
        return l1Entry.value as T;
      }
    }
    
    // Try L2
    if (tier !== 'L1') {
      const l2Path = path.join(this.l2CachePath, `${key}.parquet`);
      if (await fs.access(l2Path).then(() => true).catch(() => false)) {
        const result = await pl.read_parquet(l2Path);
        // Promote to L1 if small enough
        if (result.estimated_size() < 10 * 1024 * 1024) {
          await this.set(key, result, 'L1', 300);
        }
        return result as T;
      }
    }
    
    return null;
  }
  
  async set<T>(key: string, value: T, tier: 'L1' | 'L2', ttl: number): Promise<void> {
    if (tier === 'L1') {
      // Check memory limits
      const size = this.estimateSize(value);
      if (this.l1CurrentSize + size > this.l1MaxSize) {
        await this.evictL1(size);
      }
      
      this.l1Cache.set(key, {
        value,
        size,
        createdAt: new Date(),
        expiresAt: new Date(Date.now() + ttl * 1000),
        lastAccessed: new Date(),
        accessCount: 0,
      });
      this.l1CurrentSize += size;
      
    } else {
      // L2: Write to parquet
      const l2Path = path.join(this.l2CachePath, `${key}.parquet`);
      await value.write_parquet(l2Path, { compression: 'snappy' });
    }
  }
  
  private async evictL1(requiredSpace: number): Promise<void> {
    // LRU eviction
    const entries = Array.from(this.l1Cache.entries())
      .sort((a, b) => a[1].lastAccessed.getTime() - b[1].lastAccessed.getTime());
    
    let freed = 0;
    for (const [key, entry] of entries) {
      if (freed >= requiredSpace) break;
      
      this.l1Cache.delete(key);
      this.l1CurrentSize -= entry.size;
      freed += entry.size;
      
      // If recently accessed, move to L2
      if (entry.accessCount > 5) {
        await this.moveToL2(key, entry.value);
      }
    }
  }
}

Smart Prefetching

class PrefetchManager:
    """
    Predict and prefetch likely-needed data
    """
    
    def __init__(self):
        self.query_patterns = {}
        self.access_graph = defaultdict(list)
    
    def record_access(self, dataset: str, query: dict):
        """Track query patterns for prediction"""
        pattern = self.extract_pattern(query)
        self.query_patterns[dataset].append({
            'pattern': pattern,
            'timestamp': time.time(),
            'columns': query.get('columns', []),
        })
    
    def predict_next_queries(self, dataset: str, current_query: dict) -> list:
        """Predict what user might query next"""
        predictions = []
        
        # Pattern 1: Time series → predict next time bucket
        if 'date' in current_query.get('filters', {}):
            predictions.append(self.extend_time_range(current_query))
        
        # Pattern 2: Aggregation → predict drill-down
        if current_query.get('aggregation'):
            predictions.append(self.add_dimension(current_query))
        
        # Pattern 3: Top-N → predict next N
        if current_query.get('limit'):
            predictions.append(self.increase_limit(current_query))
        
        return predictions
    
    async def prefetch(self, dataset: str, query: dict):
        """Execute predicted queries in background"""
        predictions = self.predict_next_queries(dataset, query)
        
        for prediction in predictions[:3]:  # Max 3 prefetches
            cache_key = generate_cache_key(prediction)
            if not await cache.exists(cache_key):
                # Execute in background
                asyncio.create_task(
                    self.execute_and_cache(prediction, cache_key)
                )

Visualization Caching

@dataclass
class VisualizationCache:
    chart_type: str
    data_hash: str
    options_hash: str
    base64_image: str
    width: int
    height: int
    created_at: datetime
    
    @classmethod
    def from_request(cls, data: pd.DataFrame, chart_type: str, options: dict):
        data_hash = hash_dataframe(data)
        options_hash = hash_dict(options)
        
        return cls(
            chart_type=chart_type,
            data_hash=data_hash,
            options_hash=options_hash,
            base64_image=None,  # Generated on demand
            width=options.get('width', 800),
            height=options.get('height', 600),
            created_at=datetime.now(),
        )

def get_or_create_chart(data: pd.DataFrame, chart_type: str, options: dict) -> str:
    """Get cached chart or generate new one"""
    cache_key = VisualizationCache.from_request(data, chart_type, options)
    
    # Check cache
    cached = viz_cache.get(cache_key.data_hash + cache_key.options_hash)
    if cached:
        return cached.base64_image
    
    # Generate new chart
    fig = create_chart(data, chart_type, options)
    buf = io.BytesIO()
    fig.savefig(buf, format='png', dpi=100)
    base64_image = base64.b64encode(buf.getvalue()).decode()
    
    # Cache it
    cache_key.base64_image = base64_image
    viz_cache.set(cache_key.data_hash + cache_key.options_hash, cache_key)
    
    return base64_image

Cache Invalidation Strategies

class CacheInvalidator:
    """
    Intelligent cache invalidation
    """
    
    def on_file_change(self, file_path: str):
        """Called when a dataset file is modified"""
        # Invalidate all caches for this file
        pattern = f"*{hashlib.md5(file_path.encode()).hexdigest()[:8]}*"
        invalidated = cache.invalidate(pattern)
        
        logger.info(f"Invalidated {invalidated} cache entries for {file_path}")
    
    def on_query_error(self, query: dict, error: Exception):
        """Handle query errors - may invalidate related caches"""
        if isinstance(error, SchemaMismatchError):
            # Schema changed - invalidate all dataset caches
            self.on_file_change(query['files'][0])
    
    def scheduled_cleanup(self):
        """Periodic cleanup of stale cache entries"""
        # Remove expired entries
        for key, entry in list(l1_cache.items()):
            if entry.is_expired():
                l1_cache.delete(key)
        
        # Compact L2 cache if over size limit
        if l2_cache.size > l2_cache.max_size:
            l2_cache.evict_lfu(target_size=l2_cache.max_size * 0.8)

Performance Monitoring

interface CacheMetrics {
  // Hit rates
  l1HitRate: number;
  l2HitRate: number;
  overallHitRate: number;
  
  // Latency
  l1LatencyMs: number;    // Should be <10ms
  l2LatencyMs: number;    // Should be <100ms
  missLatencyMs: number;  // Full execution time
  
  // Usage
  l1SizeBytes: number;
  l2SizeBytes: number;
  l1EntryCount: number;
  l2EntryCount: number;
  
  // Savings
  queriesCached: number;
  executionTimeSavedMs: number;
  computeCostSaved: number;
}

class CacheMetricsCollector {
  private metrics: CacheMetrics = {
    l1HitRate: 0,
    l2HitRate: 0,
    overallHitRate: 0,
    l1LatencyMs: 0,
    l2LatencyMs: 0,
    missLatencyMs: 0,
    l1SizeBytes: 0,
    l2SizeBytes: 0,
    l1EntryCount: 0,
    l2EntryCount: 0,
    queriesCached: 0,
    executionTimeSavedMs: 0,
    computeCostSaved: 0,
  };
  
  recordAccess(tier: 'L1' | 'L2' | 'miss', latencyMs: number): void {
    // Update hit rates and latency metrics
    // Emit to monitoring system
  }
  
  getReport(): CacheMetrics {
    return { ...this.metrics };
  }
}

Consequences

Positive

  1. Sub-100ms Repeated Queries: 100x+ speedup for cached results
  2. Reduced Compute Costs: Fewer full query executions
  3. Better User Experience: Instant responses for common queries
  4. Predictable Performance: Cache hits have consistent latency
  5. Offline Capability: Cached results available without re-execution
  6. Cost Savings: Reduced CPU/memory usage on code-server

Negative

  1. Memory Overhead: Up to 512MB L1 + 5GB L2 cache storage
  2. Cache Invalidation Complexity: Must handle file changes correctly
  3. Stale Data Risk: Users may see outdated results
  4. Warm-up Time: Cold cache requires initial full executions
  5. Monitoring Overhead: Need to track cache health and hit rates

Neutral

  1. Storage Requirements: Additional disk space for L2 cache
  2. Configuration Complexity: More tuning parameters
  3. Debugging Complexity: Cached results may hide issues

Implementation Plan

Phase 1: Basic Caching (MVP)

  • Implement L1 in-memory cache for schemas
  • Cache parsed LazyFrames
  • Simple LRU eviction
  • Basic cache key generation

Phase 2: Full Caching

  • L2 disk cache with Parquet storage
  • Query result caching
  • Visualization caching
  • Cache metrics and monitoring

Phase 3: Advanced Optimization

  • Query plan optimization
  • Smart prefetching
  • Incremental query refinement
  • Cross-query optimization

Phase 4: Intelligent Features

  • Machine learning for prefetch prediction
  • Automatic cache sizing
  • Query result compression
  • Distributed caching (if needed)

Success Metrics

Metric Target Measurement
Cache hit rate >60% Access logs
L1 hit latency <10ms Timing metrics
L2 hit latency <100ms Timing metrics
Memory overhead <512MB System metrics
Disk overhead <5GB Storage metrics
Invalidation accuracy 100% Stale data incidents
Query speedup (cached) 50x A/B testing

Configuration

{
  "cache": {
    "l1": {
      "enabled": true,
      "maxSizeMB": 512,
      "defaultTTLSeconds": 900,
      "evictionPolicy": "lru"
    },
    "l2": {
      "enabled": true,
      "path": "/tmp/mcp-cache",
      "maxSizeGB": 5,
      "defaultTTLHours": 24,
      "compression": "snappy"
    },
    "prefetch": {
      "enabled": true,
      "maxPrefetchQueries": 3,
      "confidenceThreshold": 0.7
    },
    "invalidation": {
      "checkIntervalSeconds": 30,
      "aggressiveCleanup": false
    }
  }
}

Related Decisions

  • ADR-001: Analytics Architecture
  • ADR-002: Context Window Management Strategy
  • MCP-SPECIFICATION: Detailed tool specifications

Decision Record

Date Author Decision Rationale
2026-03-17 Sisyphus Adopt multi-tier caching with query optimization Essential for performance at scale

Status Legend:

  • Proposed: Under review
  • Accepted: Approved for implementation
  • Deprecated: Replaced by newer ADR
  • Superseded: See referenced ADR