MCP-SPECIFICATION.md

MCP Analytics Server Specification v2.0

Overview

This document defines the complete specification for the Code-Server MCP Analytics Server — a high-performance analytical backend for LobeHub v2 that enables processing of large-scale datasets while minimizing context window usage.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          LobeHub v2 (Agent)                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Context Window (~50k tokens reserved)                              │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │    │
│  │  │   Schema    │  │   Queries   │  │  Results    │  │  Metadata   │ │    │
│  │  │   (~300)    │  │   (~400)    │  │  (~1500)    │  │   (~200)    │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │ MCP Protocol 2.0
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    Code-Server MCP Analytics Server                          │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    Analytics Tool Registry                           │    │
│  │  ┌────────────────┐ ┌────────────────┐ ┌─────────────────────────┐  │    │
│  │  │  profile_dataset│ │ execute_query  │ │    stream_sample        │  │    │
│  │  │  export_result  │ │ cache_manager  │ │    context_optimizer    │  │    │
│  │  └────────────────┘ └────────────────┘ └─────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────┐   │
│  │   Execution Engine   │  │   Context Manager    │  │   Cache Layer    │   │
│  │  ┌────────────────┐  │  │  ┌────────────────┐  │  │  ┌────────────┐  │   │
│  │  │  Polars        │  │  │  │  Token Budget  │  │  │  │  Memory    │  │   │
│  │  │  DuckDB        │  │  │  │  Prioritization│  │  │  │  Disk      │  │   │
│  │  │  Pandas        │  │  │  │  Compression   │  │  │  │  Query     │  │   │
│  │  └────────────────┘  │  │  └────────────────┘  │  │  └────────────┘  │   │
│  └──────────────────────┘  └──────────────────────┘  └──────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         ▼                         ▼                         ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   Polars Engine  │  │   DuckDB Engine  │  │   Pandas Engine  │
│   (Primary)      │  │   (Secondary)    │  │   (Fallback)     │
│                  │  │                  │  │                  │
│  • LazyFrames    │  │  • SQL Interface │  │  • Full compat   │
│  • Streaming     │  │  • CSV-native    │  │  • ML libraries  │
│  • 10-50x faster │  │  • Join optimize │  │  • Rich API      │
└──────────────────┘  └──────────────────┘  └──────────────────┘

Core Principles

1. Compute at Edge

All heavy computations are performed within the Code-Server environment, not in the LLM context. This includes:

Data aggregation and grouping
Statistical calculations
Joins and merges
Filtering and transformations
Visualization generation

2. Context Budgeting

Strict token budget management ensures efficient context window usage:

Component	Budget (tokens)	Priority
System Prompt	500	Required
Conversation History	2000	Required
Data Schema	300	High
Sample Data	800	Medium
Query Results	1500	High
Visualization	400	Low
Buffer	1000	-
Total Reserved	~6500	-

3. Streaming Over Loading

Data is processed in streaming fashion:

No full dataset loading into memory
Chunked processing for large files
Incremental result generation
Memory-mapped file access where possible

4. Lazy Evaluation

Queries are optimized before execution:

Predicate pushdown
Column pruning
Query planning and optimization
Result caching

MCP Tool Specification

Tool: `analytics/profile_dataset`

Purpose: Perform comprehensive dataset profiling without loading data into context.

Input Schema:

{
  "type": "object",
  "properties": {
    "file_path": {
      "type": "string",
      "description": "Absolute path to dataset file (CSV, Parquet, JSON, etc.)"
    },
    "sample_size": {
      "type": "integer",
      "default": 1000,
      "minimum": 100,
      "maximum": 10000,
      "description": "Number of rows to sample for type inference"
    },
    "compute_stats": {
      "type": "boolean",
      "default": true,
      "description": "Compute detailed statistics"
    },
    "max_categories": {
      "type": "integer",
      "default": 50,
      "description": "Maximum unique categories to report for categorical columns"
    }
  },
  "required": ["file_path"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "schema": {
      "type": "object",
      "properties": {
        "columns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "type": { "type": "string", "enum": ["int64", "float64", "string", "bool", "datetime", "category"] },
              "null_pct": { "type": "number" },
              "unique_count": { "type": "integer" },
              "sample_values": { "type": "array" }
            }
          }
        }
      }
    },
    "statistics": {
      "type": "object",
      "properties": {
        "row_count": { "type": "integer" },
        "column_count": { "type": "integer" },
        "memory_estimate": { "type": "string" },
        "file_size": { "type": "string" },
        "quality_score": { "type": "number" }
      }
    },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact: ~200-400 tokens

Example:

{
  "file_path": "/workspace/sales_2024.csv",
  "sample_size": 5000,
  "compute_stats": true
}

Tool: `analytics/execute_query`

Purpose: Execute analytical queries with configurable engine and return optimized results.

Input Schema:

{
  "type": "object",
  "properties": {
    "engine": {
      "type": "string",
      "enum": ["polars", "duckdb", "pandas", "auto"],
      "default": "auto",
      "description": "Query execution engine"
    },
    "query_type": {
      "type": "string",
      "enum": ["sql", "polars_expr", "python"],
      "description": "Type of query expression"
    },
    "query": {
      "type": "string",
      "description": "Query string or expression"
    },
    "files": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Input file paths"
    },
    "return_limit": {
      "type": "integer",
      "default": 100,
      "minimum": 1,
      "maximum": 1000,
      "description": "Maximum rows to return"
    },
    "return_format": {
      "type": "string",
      "enum": ["json", "csv", "summary", "chart"],
      "default": "json",
      "description": "Result format"
    },
    "generate_chart": {
      "type": "boolean",
      "default": false,
      "description": "Generate visualization"
    },
    "chart_type": {
      "type": "string",
      "enum": ["bar", "line", "scatter", "histogram", "heatmap"],
      "description": "Chart type for visualization"
    },
    "cache_key": {
      "type": "string",
      "description": "Cache key for result caching"
    }
  },
  "required": ["query_type", "query", "files"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "result_type": {
      "type": "string",
      "enum": ["tabular", "aggregation", "chart", "error"]
    },
    "data": {
      "type": "array",
      "description": "Query results (limited to return_limit)"
    },
    "summary": {
      "type": "object",
      "properties": {
        "rows_processed": { "type": "integer" },
        "execution_time_ms": { "type": "integer" },
        "cache_hit": { "type": "boolean" },
        "engine_used": { "type": "string" }
      }
    },
    "columns": {
      "type": "array",
      "items": { "type": "string" }
    },
    "visualization": {
      "type": "object",
      "properties": {
        "type": { "type": "string" },
        "data_url": { "type": "string" },
        "base64_image": { "type": "string" }
      }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact:

Aggregation: ~200-500 tokens
Tabular (limited): ~500-2000 tokens
Chart: ~100-300 tokens (base64)

Example:

{
  "engine": "duckdb",
  "query_type": "sql",
  "query": "SELECT category, SUM(revenue) as total, COUNT(*) as orders FROM read_csv_auto('sales_*.csv') GROUP BY category ORDER BY total DESC",
  "files": ["/workspace/sales_2024.csv", "/workspace/sales_2023.csv"],
  "return_limit": 20,
  "return_format": "json",
  "generate_chart": true,
  "chart_type": "bar"
}

Tool: `analytics/stream_sample`

Purpose: Extract representative data samples for context window inspection.

Input Schema:

{
  "type": "object",
  "properties": {
    "file_path": {
      "type": "string",
      "description": "Path to dataset file"
    },
    "strategy": {
      "type": "string",
      "enum": ["head", "random", "stratified", "systematic"],
      "default": "random",
      "description": "Sampling strategy"
    },
    "sample_size": {
      "type": "integer",
      "default": 20,
      "minimum": 1,
      "maximum": 100,
      "description": "Number of rows to sample"
    },
    "columns": {
      "type": ["array", "null"],
      "items": { "type": "string" },
      "description": "Columns to include (null = all)"
    },
    "stratify_column": {
      "type": "string",
      "description": "Column for stratified sampling"
    }
  },
  "required": ["file_path"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "sample": {
      "type": "array",
      "items": { "type": "object" }
    },
    "sampling_info": {
      "type": "object",
      "properties": {
        "strategy": { "type": "string" },
        "rows_sampled": { "type": "integer" },
        "total_rows": { "type": "integer" },
        "columns_included": { "type": "array" }
      }
    },
    "context_tokens_used": { "type": "integer" }
  }
}

Context Impact: ~20-100 tokens per row

Example:

{
  "file_path": "/workspace/customers.csv",
  "strategy": "stratified",
  "sample_size": 15,
  "stratify_column": "segment"
}

Tool: `analytics/export_result`

Purpose: Persist query results to workspace for subsequent access.

Input Schema:

{
  "type": "object",
  "properties": {
    "query_id": {
      "type": "string",
      "description": "Query result identifier"
    },
    "format": {
      "type": "string",
      "enum": ["csv", "parquet", "json", "xlsx"],
      "default": "parquet"
    },
    "destination": {
      "type": "string",
      "description": "Output directory path"
    },
    "filename": {
      "type": "string",
      "description": "Output filename (optional)"
    },
    "partition_by": {
      "type": "string",
      "description": "Column to partition output by"
    },
    "compression": {
      "type": "string",
      "enum": ["none", "gzip", "snappy", "zstd"],
      "default": "snappy"
    }
  },
  "required": ["query_id", "destination"]
}

Output Schema:

{
  "type": "object",
  "properties": {
    "exported": { "type": "boolean" },
    "output_path": { "type": "string" },
    "row_count": { "type": "integer" },
    "file_size": { "type": "string" },
    "files_created": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Context Impact: ~50 tokens

Tool: `analytics/cache_invalidate`

Purpose: Manually invalidate cached results.

Input Schema:

{
  "type": "object",
  "properties": {
    "pattern": {
      "type": "string",
      "description": "Cache key pattern to invalidate (supports wildcards)"
    },
    "file_path": {
      "type": "string",
      "description": "Invalidate all caches for specific file"
    }
  }
}

Output Schema:

{
  "type": "object",
  "properties": {
    "invalidated_count": { "type": "integer" },
    "keys_removed": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Engine Specifications

Polars (Primary Engine)

Strengths:

Lazy evaluation with query optimization
Streaming execution for large datasets
10-50x faster than pandas
Memory-efficient columnar format
Native expression API

Best For:

Large CSV/Parquet files (>100MB)
Complex transformations
Aggregation queries
Columnar operations

Example Usage:

# Polars expression query
{
  "engine": "polars",
  "query_type": "polars_expr",
  "query": """
    pl.scan_csv('data.csv')
      .filter(pl.col('revenue') > 1000)
      .group_by('category')
      .agg([
        pl.col('revenue').sum().alias('total_revenue'),
        pl.col('orders').count().alias('order_count')
      ])
  """
}

DuckDB (Secondary Engine)

Strengths:

Full SQL support
CSV-native reading
Efficient joins
Window functions
Familiar syntax

Best For:

SQL-heavy workflows
Cross-file analysis (glob patterns)
Complex joins
Window analytics

Example Usage:

-- DuckDB SQL query
{
  "engine": "duckdb",
  "query_type": "sql",
  "query": """
    SELECT 
      year,
      SUM(revenue) as total_revenue,
      AVG(quantity) as avg_quantity,
      LAG(SUM(revenue)) OVER (ORDER BY year) as prev_year
    FROM read_csv_auto('sales_*.csv')
    GROUP BY year
  """
}

Pandas (Fallback Engine)

Strengths:

Maximum ecosystem compatibility
Rich visualization libraries
Machine learning integration
Most comprehensive API

Best For:

ML preprocessing
Custom visualizations
Complex data manipulation
Libraries requiring pandas

Example Usage:

# Pandas Python query
{
  "engine": "pandas",
  "query_type": "python",
  "query": """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    df = pd.read_csv('data.csv')
    scaler = StandardScaler()
    df['normalized'] = scaler.fit_transform(df[['value']])
    result = df.describe()
  """
}

Data Flow Patterns

Pattern 1: Large File Analysis

User: "Проанализируй sales.csv (500MB)"
    │
    ▼
LobeHub: [Detects large file]
    │
    ▼
MCP: profile_dataset(sales.csv)
    │
    ▼
LobeHub: "Обнаружено 10M строк, 25 колонок.
          Ключевые метрики: [агрегаты]"
    │
    ▼
User: "Покажи топ-10 категорий по выручке"
    │
    ▼
MCP: execute_query(aggregation, limit=10, chart=bar)
    │
    ▼
LobeHub: "[Таблица + график]"
          Context used: ~800 tokens vs 2M+ tokens

Pattern 2: Cross-File Analytics

User: "Сравни метрики 2023 vs 2024"
    │
    ▼
MCP: execute_query(
       engine="duckdb",
       query="SELECT year, SUM(revenue)... FROM read_csv_auto('sales_*.csv')..."
     )
    │
    ▼
LobeHub: "2023: $24.5M, 2024: $31.2M (+27%)"
          Context used: ~300 tokens

Pattern 3: Iterative Exploration

Step 1: profile_dataset        → Schema, stats        (~250 tokens)
Step 2: stream_sample          → Representative rows  (~800 tokens)
Step 3: execute_query (filter) → Filtered subset      (~400 tokens)
Step 4: execute_query (agg)    → Aggregations         (~500 tokens)
Step 5: export_result          → Save to workspace    (~50 tokens)
                                    ─────────────────────────
Total: ~2000 tokens vs 100k+ for full file

Error Handling

Error Categories

Code	Category	Description
4001	Invalid Input	Malformed query or parameters
4002	File Not Found	Dataset file doesn't exist
4003	Schema Error	Schema mismatch or invalid column
4004	Query Error	Syntax error in query
5001	Engine Error	Execution engine failure
5002	Memory Error	Out of memory during execution
5003	Timeout	Query exceeded time limit
5004	Cache Error	Cache operation failed

Error Response Format

{
  "error": {
    "code": 4004,
    "category": "Query Error",
    "message": "Column 'revenu' not found. Did you mean 'revenue'?",
    "suggestions": ["Check column names with profile_dataset", "Use schema information"],
    "query_context": "..."
  }
}

Configuration

Server Configuration

{
  "mcpServers": {
    "code-server-analytics": {
      "command": "python",
      "args": ["-m", "code_server_mcp.analytics"],
      "env": {
        "POLARS_MAX_THREADS": "4",
        "DUCKDB_MEMORY_LIMIT": "2GB",
        "CACHE_MAX_SIZE": "1GB",
        "CACHE_TTL_SECONDS": "3600",
        "MAX_QUERY_TIME_MS": "30000",
        "STREAMING_THRESHOLD_MB": "100"
      }
    }
  }
}

Engine Selection Strategy

Auto Engine Selection Logic:

File size < 10MB → Pandas (ecosystem compatibility)
SQL-like query → DuckDB
Complex transforms → Polars
Multiple files → DuckDB
ML operations → Pandas

Performance Thresholds

Metric	Threshold	Action
Query time	> 30s	Timeout, return partial
Memory usage	> 2GB	Switch to streaming
Result rows	> 10,000	Truncate + warning
Context tokens	> 2000	Compress + summary
File size	> 1GB	Force streaming mode

Version History

Version	Date	Changes
2.0.0	2026-03-17	Initial analytics specification
2.0.1	Planned	Multi-engine query optimization
2.1.0	Planned	Real-time streaming support
2.2.0	Planned	Advanced caching strategies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP Analytics Server Specification v2.0

Overview

Architecture Overview

Core Principles

1. Compute at Edge

2. Context Budgeting

3. Streaming Over Loading

4. Lazy Evaluation

MCP Tool Specification

Tool: `analytics/profile_dataset`

Tool: `analytics/execute_query`

Tool: `analytics/stream_sample`

Tool: `analytics/export_result`

Tool: `analytics/cache_invalidate`

Engine Specifications

Polars (Primary Engine)

DuckDB (Secondary Engine)

Pandas (Fallback Engine)

Data Flow Patterns

Pattern 1: Large File Analysis

Pattern 2: Cross-File Analytics

Pattern 3: Iterative Exploration

Error Handling

Error Categories

Error Response Format

Configuration

Server Configuration

Engine Selection Strategy

Performance Thresholds

Version History

Related Documents

FilesExpand file tree

MCP-SPECIFICATION.md

Latest commit

History

MCP-SPECIFICATION.md

File metadata and controls

MCP Analytics Server Specification v2.0

Overview

Architecture Overview

Core Principles

1. Compute at Edge

2. Context Budgeting

3. Streaming Over Loading

4. Lazy Evaluation

MCP Tool Specification

Tool: analytics/profile_dataset

Tool: analytics/execute_query

Tool: analytics/stream_sample

Tool: analytics/export_result

Tool: analytics/cache_invalidate

Engine Specifications

Polars (Primary Engine)

DuckDB (Secondary Engine)

Pandas (Fallback Engine)

Data Flow Patterns

Pattern 1: Large File Analysis

Pattern 2: Cross-File Analytics

Pattern 3: Iterative Exploration

Error Handling

Error Categories

Error Response Format

Configuration

Server Configuration

Engine Selection Strategy

Performance Thresholds

Version History

Related Documents

Tool: `analytics/profile_dataset`

Tool: `analytics/execute_query`

Tool: `analytics/stream_sample`

Tool: `analytics/export_result`

Tool: `analytics/cache_invalidate`