Skip to content

Latest commit

 

History

History
708 lines (638 loc) · 21.4 KB

File metadata and controls

708 lines (638 loc) · 21.4 KB

MCP Analytics Server Specification v2.0

Overview

This document defines the complete specification for the Code-Server MCP Analytics Server — a high-performance analytical backend for LobeHub v2 that enables processing of large-scale datasets while minimizing context window usage.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          LobeHub v2 (Agent)                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Context Window (~50k tokens reserved)                              │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │    │
│  │  │   Schema    │  │   Queries   │  │  Results    │  │  Metadata   │ │    │
│  │  │   (~300)    │  │   (~400)    │  │  (~1500)    │  │   (~200)    │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │ MCP Protocol 2.0
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    Code-Server MCP Analytics Server                          │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    Analytics Tool Registry                           │    │
│  │  ┌────────────────┐ ┌────────────────┐ ┌─────────────────────────┐  │    │
│  │  │  profile_dataset│ │ execute_query  │ │    stream_sample        │  │    │
│  │  │  export_result  │ │ cache_manager  │ │    context_optimizer    │  │    │
│  │  └────────────────┘ └────────────────┘ └─────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────┐   │
│  │   Execution Engine   │  │   Context Manager    │  │   Cache Layer    │   │
│  │  ┌────────────────┐  │  │  ┌────────────────┐  │  │  ┌────────────┐  │   │
│  │  │  Polars        │  │  │  │  Token Budget  │  │  │  │  Memory    │  │   │
│  │  │  DuckDB        │  │  │  │  Prioritization│  │  │  │  Disk      │  │   │
│  │  │  Pandas        │  │  │  │  Compression   │  │  │  │  Query     │  │   │
│  │  └────────────────┘  │  │  └────────────────┘  │  │  └────────────┘  │   │
│  └──────────────────────┘  └──────────────────────┘  └──────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         ▼                         ▼                         ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   Polars Engine  │  │   DuckDB Engine  │  │   Pandas Engine  │
│   (Primary)      │  │   (Secondary)    │  │   (Fallback)     │
│                  │  │                  │  │                  │
│  • LazyFrames    │  │  • SQL Interface │  │  • Full compat   │
│  • Streaming     │  │  • CSV-native    │  │  • ML libraries  │
│  • 10-50x faster │  │  • Join optimize │  │  • Rich API      │
└──────────────────┘  └──────────────────┘  └──────────────────┘

Core Principles

1. Compute at Edge

All heavy computations are performed within the Code-Server environment, not in the LLM context. This includes:

  • Data aggregation and grouping
  • Statistical calculations
  • Joins and merges
  • Filtering and transformations
  • Visualization generation

2. Context Budgeting

Strict token budget management ensures efficient context window usage:

Component Budget (tokens) Priority
System Prompt 500 Required
Conversation History 2000 Required
Data Schema 300 High
Sample Data 800 Medium
Query Results 1500 High
Visualization 400 Low
Buffer 1000 -
Total Reserved ~6500 -

3. Streaming Over Loading

Data is processed in streaming fashion:

  • No full dataset loading into memory
  • Chunked processing for large files
  • Incremental result generation
  • Memory-mapped file access where possible

4. Lazy Evaluation

Queries are optimized before execution:

  • Predicate pushdown
  • Column pruning
  • Query planning and optimization
  • Result caching

MCP Tool Specification

Tool: analytics/profile_dataset

Purpose: Perform comprehensive dataset profiling without loading data into context.

Input Schema:

{
  "type": "object",
  "properties": {
    "file_path": {
      "type": "string",
      "description": "Absolute path to dataset file (CSV, Parquet, JSON, etc.)"
    },
    "sample_size": {
      "type": "integer",
      "default": 1000,
      "minimum": 100,
      "maximum": 10000,
      "description": "Number of rows to sample for type inference"
    },
    "compute_stats": {
      "type": "boolean",
      "default": true,
      "description": "Compute detailed statistics"
    },
    "max_categories": {
      "type": "integer",
      "default": 50,
      "description": "Maximum unique categories to report for categorical columns"
    }
  },
  "required": ["file_path"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "schema": {
      "type": "object",
      "properties": {
        "columns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "type": { "type": "string", "enum": ["int64", "float64", "string", "bool", "datetime", "category"] },
              "null_pct": { "type": "number" },
              "unique_count": { "type": "integer" },
              "sample_values": { "type": "array" }
            }
          }
        }
      }
    },
    "statistics": {
      "type": "object",
      "properties": {
        "row_count": { "type": "integer" },
        "column_count": { "type": "integer" },
        "memory_estimate": { "type": "string" },
        "file_size": { "type": "string" },
        "quality_score": { "type": "number" }
      }
    },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact: ~200-400 tokens

Example:

{
  "file_path": "/workspace/sales_2024.csv",
  "sample_size": 5000,
  "compute_stats": true
}

Tool: analytics/execute_query

Purpose: Execute analytical queries with configurable engine and return optimized results.

Input Schema:

{
  "type": "object",
  "properties": {
    "engine": {
      "type": "string",
      "enum": ["polars", "duckdb", "pandas", "auto"],
      "default": "auto",
      "description": "Query execution engine"
    },
    "query_type": {
      "type": "string",
      "enum": ["sql", "polars_expr", "python"],
      "description": "Type of query expression"
    },
    "query": {
      "type": "string",
      "description": "Query string or expression"
    },
    "files": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Input file paths"
    },
    "return_limit": {
      "type": "integer",
      "default": 100,
      "minimum": 1,
      "maximum": 1000,
      "description": "Maximum rows to return"
    },
    "return_format": {
      "type": "string",
      "enum": ["json", "csv", "summary", "chart"],
      "default": "json",
      "description": "Result format"
    },
    "generate_chart": {
      "type": "boolean",
      "default": false,
      "description": "Generate visualization"
    },
    "chart_type": {
      "type": "string",
      "enum": ["bar", "line", "scatter", "histogram", "heatmap"],
      "description": "Chart type for visualization"
    },
    "cache_key": {
      "type": "string",
      "description": "Cache key for result caching"
    }
  },
  "required": ["query_type", "query", "files"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "result_type": {
      "type": "string",
      "enum": ["tabular", "aggregation", "chart", "error"]
    },
    "data": {
      "type": "array",
      "description": "Query results (limited to return_limit)"
    },
    "summary": {
      "type": "object",
      "properties": {
        "rows_processed": { "type": "integer" },
        "execution_time_ms": { "type": "integer" },
        "cache_hit": { "type": "boolean" },
        "engine_used": { "type": "string" }
      }
    },
    "columns": {
      "type": "array",
      "items": { "type": "string" }
    },
    "visualization": {
      "type": "object",
      "properties": {
        "type": { "type": "string" },
        "data_url": { "type": "string" },
        "base64_image": { "type": "string" }
      }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact:

  • Aggregation: ~200-500 tokens
  • Tabular (limited): ~500-2000 tokens
  • Chart: ~100-300 tokens (base64)

Example:

{
  "engine": "duckdb",
  "query_type": "sql",
  "query": "SELECT category, SUM(revenue) as total, COUNT(*) as orders FROM read_csv_auto('sales_*.csv') GROUP BY category ORDER BY total DESC",
  "files": ["/workspace/sales_2024.csv", "/workspace/sales_2023.csv"],
  "return_limit": 20,
  "return_format": "json",
  "generate_chart": true,
  "chart_type": "bar"
}

Tool: analytics/stream_sample

Purpose: Extract representative data samples for context window inspection.

Input Schema:

{
  "type": "object",
  "properties": {
    "file_path": {
      "type": "string",
      "description": "Path to dataset file"
    },
    "strategy": {
      "type": "string",
      "enum": ["head", "random", "stratified", "systematic"],
      "default": "random",
      "description": "Sampling strategy"
    },
    "sample_size": {
      "type": "integer",
      "default": 20,
      "minimum": 1,
      "maximum": 100,
      "description": "Number of rows to sample"
    },
    "columns": {
      "type": ["array", "null"],
      "items": { "type": "string" },
      "description": "Columns to include (null = all)"
    },
    "stratify_column": {
      "type": "string",
      "description": "Column for stratified sampling"
    }
  },
  "required": ["file_path"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "sample": {
      "type": "array",
      "items": { "type": "object" }
    },
    "sampling_info": {
      "type": "object",
      "properties": {
        "strategy": { "type": "string" },
        "rows_sampled": { "type": "integer" },
        "total_rows": { "type": "integer" },
        "columns_included": { "type": "array" }
      }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact: ~20-100 tokens per row

Example:

{
  "file_path": "/workspace/customers.csv",
  "strategy": "stratified",
  "sample_size": 15,
  "stratify_column": "segment"
}

Tool: analytics/export_result

Purpose: Persist query results to workspace for subsequent access.

Input Schema:

{
  "type": "object",
  "properties": {
    "query_id": {
      "type": "string",
      "description": "Query result identifier"
    },
    "format": {
      "type": "string",
      "enum": ["csv", "parquet", "json", "xlsx"],
      "default": "parquet"
    },
    "destination": {
      "type": "string",
      "description": "Output directory path"
    },
    "filename": {
      "type": "string",
      "description": "Output filename (optional)"
    },
    "partition_by": {
      "type": "string",
      "description": "Column to partition output by"
    },
    "compression": {
      "type": "string",
      "enum": ["none", "gzip", "snappy", "zstd"],
      "default": "snappy"
    }
  },
  "required": ["query_id", "destination"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "exported": { "type": "boolean" },
    "output_path": { "type": "string" },
    "row_count": { "type": "integer" },
    "file_size": { "type": "string" },
    "files_created": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Context Impact: ~50 tokens

Tool: analytics/cache_invalidate

Purpose: Manually invalidate cached results.

Input Schema:

{
  "type": "object",
  "properties": {
    "pattern": {
      "type": "string",
      "description": "Cache key pattern to invalidate (supports wildcards)"
    },
    "file_path": {
      "type": "string",
      "description": "Invalidate all caches for specific file"
    }
  }
}

Output Schema:

{
  "type": "object",
  "properties": {
    "invalidated_count": { "type": "integer" },
    "keys_removed": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Engine Specifications

Polars (Primary Engine)

Strengths:

  • Lazy evaluation with query optimization
  • Streaming execution for large datasets
  • 10-50x faster than pandas
  • Memory-efficient columnar format
  • Native expression API

Best For:

  • Large CSV/Parquet files (>100MB)
  • Complex transformations
  • Aggregation queries
  • Columnar operations

Example Usage:

# Polars expression query
{
  "engine": "polars",
  "query_type": "polars_expr",
  "query": """
    pl.scan_csv('data.csv')
      .filter(pl.col('revenue') > 1000)
      .group_by('category')
      .agg([
        pl.col('revenue').sum().alias('total_revenue'),
        pl.col('orders').count().alias('order_count')
      ])
  """
}

DuckDB (Secondary Engine)

Strengths:

  • Full SQL support
  • CSV-native reading
  • Efficient joins
  • Window functions
  • Familiar syntax

Best For:

  • SQL-heavy workflows
  • Cross-file analysis (glob patterns)
  • Complex joins
  • Window analytics

Example Usage:

-- DuckDB SQL query
{
  "engine": "duckdb",
  "query_type": "sql",
  "query": """
    SELECT 
      year,
      SUM(revenue) as total_revenue,
      AVG(quantity) as avg_quantity,
      LAG(SUM(revenue)) OVER (ORDER BY year) as prev_year
    FROM read_csv_auto('sales_*.csv')
    GROUP BY year
  """
}

Pandas (Fallback Engine)

Strengths:

  • Maximum ecosystem compatibility
  • Rich visualization libraries
  • Machine learning integration
  • Most comprehensive API

Best For:

  • ML preprocessing
  • Custom visualizations
  • Complex data manipulation
  • Libraries requiring pandas

Example Usage:

# Pandas Python query
{
  "engine": "pandas",
  "query_type": "python",
  "query": """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    df = pd.read_csv('data.csv')
    scaler = StandardScaler()
    df['normalized'] = scaler.fit_transform(df[['value']])
    result = df.describe()
  """
}

Data Flow Patterns

Pattern 1: Large File Analysis

User: "Проанализируй sales.csv (500MB)"
    │
    ▼
LobeHub: [Detects large file]
    │
    ▼
MCP: profile_dataset(sales.csv)
    │
    ▼
LobeHub: "Обнаружено 10M строк, 25 колонок.
          Ключевые метрики: [агрегаты]"
    │
    ▼
User: "Покажи топ-10 категорий по выручке"
    │
    ▼
MCP: execute_query(aggregation, limit=10, chart=bar)
    │
    ▼
LobeHub: "[Таблица + график]"
          Context used: ~800 tokens vs 2M+ tokens

Pattern 2: Cross-File Analytics

User: "Сравни метрики 2023 vs 2024"
    │
    ▼
MCP: execute_query(
       engine="duckdb",
       query="SELECT year, SUM(revenue)... FROM read_csv_auto('sales_*.csv')..."
     )
    │
    ▼
LobeHub: "2023: $24.5M, 2024: $31.2M (+27%)"
          Context used: ~300 tokens

Pattern 3: Iterative Exploration

Step 1: profile_dataset        → Schema, stats        (~250 tokens)
Step 2: stream_sample          → Representative rows  (~800 tokens)
Step 3: execute_query (filter) → Filtered subset      (~400 tokens)
Step 4: execute_query (agg)    → Aggregations         (~500 tokens)
Step 5: export_result          → Save to workspace    (~50 tokens)
                                    ─────────────────────────
Total: ~2000 tokens vs 100k+ for full file

Error Handling

Error Categories

Code Category Description
4001 Invalid Input Malformed query or parameters
4002 File Not Found Dataset file doesn't exist
4003 Schema Error Schema mismatch or invalid column
4004 Query Error Syntax error in query
5001 Engine Error Execution engine failure
5002 Memory Error Out of memory during execution
5003 Timeout Query exceeded time limit
5004 Cache Error Cache operation failed

Error Response Format

{
  "error": {
    "code": 4004,
    "category": "Query Error",
    "message": "Column 'revenu' not found. Did you mean 'revenue'?",
    "suggestions": ["Check column names with profile_dataset", "Use schema information"],
    "query_context": "..."
  }
}

Configuration

Server Configuration

{
  "mcpServers": {
    "code-server-analytics": {
      "command": "python",
      "args": ["-m", "code_server_mcp.analytics"],
      "env": {
        "POLARS_MAX_THREADS": "4",
        "DUCKDB_MEMORY_LIMIT": "2GB",
        "CACHE_MAX_SIZE": "1GB",
        "CACHE_TTL_SECONDS": "3600",
        "MAX_QUERY_TIME_MS": "30000",
        "STREAMING_THRESHOLD_MB": "100"
      }
    }
  }
}

Engine Selection Strategy

Auto Engine Selection Logic:

  1. File size < 10MB → Pandas (ecosystem compatibility)
  2. SQL-like query → DuckDB
  3. Complex transforms → Polars
  4. Multiple files → DuckDB
  5. ML operations → Pandas

Performance Thresholds

Metric Threshold Action
Query time > 30s Timeout, return partial
Memory usage > 2GB Switch to streaming
Result rows > 10,000 Truncate + warning
Context tokens > 2000 Compress + summary
File size > 1GB Force streaming mode

Version History

Version Date Changes
2.0.0 2026-03-17 Initial analytics specification
2.0.1 Planned Multi-engine query optimization
2.1.0 Planned Real-time streaming support
2.2.0 Planned Advanced caching strategies

Related Documents

  • ADR-001: Analytics Architecture
  • ADR-002: Context Management Strategy
  • ADR-003: Caching and Query Optimization
  • LobeHub v2 Integration Guide