This document defines the complete specification for the Code-Server MCP Analytics Server — a high-performance analytical backend for LobeHub v2 that enables processing of large-scale datasets while minimizing context window usage.
┌─────────────────────────────────────────────────────────────────────────────┐
│ LobeHub v2 (Agent) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Context Window (~50k tokens reserved) │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Schema │ │ Queries │ │ Results │ │ Metadata │ │ │
│ │ │ (~300) │ │ (~400) │ │ (~1500) │ │ (~200) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────────┘
│ MCP Protocol 2.0
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Code-Server MCP Analytics Server │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Analytics Tool Registry │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ profile_dataset│ │ execute_query │ │ stream_sample │ │ │
│ │ │ export_result │ │ cache_manager │ │ context_optimizer │ │ │
│ │ └────────────────┘ └────────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────┐ │
│ │ Execution Engine │ │ Context Manager │ │ Cache Layer │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────┐ │ │
│ │ │ Polars │ │ │ │ Token Budget │ │ │ │ Memory │ │ │
│ │ │ DuckDB │ │ │ │ Prioritization│ │ │ │ Disk │ │ │
│ │ │ Pandas │ │ │ │ Compression │ │ │ │ Query │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │ └────────────┘ │ │
│ └──────────────────────┘ └──────────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Polars Engine │ │ DuckDB Engine │ │ Pandas Engine │
│ (Primary) │ │ (Secondary) │ │ (Fallback) │
│ │ │ │ │ │
│ • LazyFrames │ │ • SQL Interface │ │ • Full compat │
│ • Streaming │ │ • CSV-native │ │ • ML libraries │
│ • 10-50x faster │ │ • Join optimize │ │ • Rich API │
└──────────────────┘ └──────────────────┘ └──────────────────┘
All heavy computations are performed within the Code-Server environment, not in the LLM context. This includes:
- Data aggregation and grouping
- Statistical calculations
- Joins and merges
- Filtering and transformations
- Visualization generation
Strict token budget management ensures efficient context window usage:
| Component | Budget (tokens) | Priority |
|---|---|---|
| System Prompt | 500 | Required |
| Conversation History | 2000 | Required |
| Data Schema | 300 | High |
| Sample Data | 800 | Medium |
| Query Results | 1500 | High |
| Visualization | 400 | Low |
| Buffer | 1000 | - |
| Total Reserved | ~6500 | - |
Data is processed in streaming fashion:
- No full dataset loading into memory
- Chunked processing for large files
- Incremental result generation
- Memory-mapped file access where possible
Queries are optimized before execution:
- Predicate pushdown
- Column pruning
- Query planning and optimization
- Result caching
Purpose: Perform comprehensive dataset profiling without loading data into context.
Input Schema:
{
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Absolute path to dataset file (CSV, Parquet, JSON, etc.)"
},
"sample_size": {
"type": "integer",
"default": 1000,
"minimum": 100,
"maximum": 10000,
"description": "Number of rows to sample for type inference"
},
"compute_stats": {
"type": "boolean",
"default": true,
"description": "Compute detailed statistics"
},
"max_categories": {
"type": "integer",
"default": 50,
"description": "Maximum unique categories to report for categorical columns"
}
},
"required": ["file_path"]
}Output Schema:
{
"type": "object",
"properties": {
"schema": {
"type": "object",
"properties": {
"columns": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"type": { "type": "string", "enum": ["int64", "float64", "string", "bool", "datetime", "category"] },
"null_pct": { "type": "number" },
"unique_count": { "type": "integer" },
"sample_values": { "type": "array" }
}
}
}
}
},
"statistics": {
"type": "object",
"properties": {
"row_count": { "type": "integer" },
"column_count": { "type": "integer" },
"memory_estimate": { "type": "string" },
"file_size": { "type": "string" },
"quality_score": { "type": "number" }
}
},
"recommendations": {
"type": "array",
"items": { "type": "string" }
},
"context_tokens_used": { "type": "integer" }
}
}Context Impact: ~200-400 tokens
Example:
{
"file_path": "/workspace/sales_2024.csv",
"sample_size": 5000,
"compute_stats": true
}Purpose: Execute analytical queries with configurable engine and return optimized results.
Input Schema:
{
"type": "object",
"properties": {
"engine": {
"type": "string",
"enum": ["polars", "duckdb", "pandas", "auto"],
"default": "auto",
"description": "Query execution engine"
},
"query_type": {
"type": "string",
"enum": ["sql", "polars_expr", "python"],
"description": "Type of query expression"
},
"query": {
"type": "string",
"description": "Query string or expression"
},
"files": {
"type": "array",
"items": { "type": "string" },
"description": "Input file paths"
},
"return_limit": {
"type": "integer",
"default": 100,
"minimum": 1,
"maximum": 1000,
"description": "Maximum rows to return"
},
"return_format": {
"type": "string",
"enum": ["json", "csv", "summary", "chart"],
"default": "json",
"description": "Result format"
},
"generate_chart": {
"type": "boolean",
"default": false,
"description": "Generate visualization"
},
"chart_type": {
"type": "string",
"enum": ["bar", "line", "scatter", "histogram", "heatmap"],
"description": "Chart type for visualization"
},
"cache_key": {
"type": "string",
"description": "Cache key for result caching"
}
},
"required": ["query_type", "query", "files"]
}Output Schema:
{
"type": "object",
"properties": {
"result_type": {
"type": "string",
"enum": ["tabular", "aggregation", "chart", "error"]
},
"data": {
"type": "array",
"description": "Query results (limited to return_limit)"
},
"summary": {
"type": "object",
"properties": {
"rows_processed": { "type": "integer" },
"execution_time_ms": { "type": "integer" },
"cache_hit": { "type": "boolean" },
"engine_used": { "type": "string" }
}
},
"columns": {
"type": "array",
"items": { "type": "string" }
},
"visualization": {
"type": "object",
"properties": {
"type": { "type": "string" },
"data_url": { "type": "string" },
"base64_image": { "type": "string" }
}
},
"context_tokens_used": { "type": "integer" }
}
}Context Impact:
- Aggregation: ~200-500 tokens
- Tabular (limited): ~500-2000 tokens
- Chart: ~100-300 tokens (base64)
Example:
{
"engine": "duckdb",
"query_type": "sql",
"query": "SELECT category, SUM(revenue) as total, COUNT(*) as orders FROM read_csv_auto('sales_*.csv') GROUP BY category ORDER BY total DESC",
"files": ["/workspace/sales_2024.csv", "/workspace/sales_2023.csv"],
"return_limit": 20,
"return_format": "json",
"generate_chart": true,
"chart_type": "bar"
}Purpose: Extract representative data samples for context window inspection.
Input Schema:
{
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Path to dataset file"
},
"strategy": {
"type": "string",
"enum": ["head", "random", "stratified", "systematic"],
"default": "random",
"description": "Sampling strategy"
},
"sample_size": {
"type": "integer",
"default": 20,
"minimum": 1,
"maximum": 100,
"description": "Number of rows to sample"
},
"columns": {
"type": ["array", "null"],
"items": { "type": "string" },
"description": "Columns to include (null = all)"
},
"stratify_column": {
"type": "string",
"description": "Column for stratified sampling"
}
},
"required": ["file_path"]
}Output Schema:
{
"type": "object",
"properties": {
"sample": {
"type": "array",
"items": { "type": "object" }
},
"sampling_info": {
"type": "object",
"properties": {
"strategy": { "type": "string" },
"rows_sampled": { "type": "integer" },
"total_rows": { "type": "integer" },
"columns_included": { "type": "array" }
}
},
"context_tokens_used": { "type": "integer" }
}
}Context Impact: ~20-100 tokens per row
Example:
{
"file_path": "/workspace/customers.csv",
"strategy": "stratified",
"sample_size": 15,
"stratify_column": "segment"
}Purpose: Persist query results to workspace for subsequent access.
Input Schema:
{
"type": "object",
"properties": {
"query_id": {
"type": "string",
"description": "Query result identifier"
},
"format": {
"type": "string",
"enum": ["csv", "parquet", "json", "xlsx"],
"default": "parquet"
},
"destination": {
"type": "string",
"description": "Output directory path"
},
"filename": {
"type": "string",
"description": "Output filename (optional)"
},
"partition_by": {
"type": "string",
"description": "Column to partition output by"
},
"compression": {
"type": "string",
"enum": ["none", "gzip", "snappy", "zstd"],
"default": "snappy"
}
},
"required": ["query_id", "destination"]
}Output Schema:
{
"type": "object",
"properties": {
"exported": { "type": "boolean" },
"output_path": { "type": "string" },
"row_count": { "type": "integer" },
"file_size": { "type": "string" },
"files_created": {
"type": "array",
"items": { "type": "string" }
}
}
}Context Impact: ~50 tokens
Purpose: Manually invalidate cached results.
Input Schema:
{
"type": "object",
"properties": {
"pattern": {
"type": "string",
"description": "Cache key pattern to invalidate (supports wildcards)"
},
"file_path": {
"type": "string",
"description": "Invalidate all caches for specific file"
}
}
}Output Schema:
{
"type": "object",
"properties": {
"invalidated_count": { "type": "integer" },
"keys_removed": {
"type": "array",
"items": { "type": "string" }
}
}
}Strengths:
- Lazy evaluation with query optimization
- Streaming execution for large datasets
- 10-50x faster than pandas
- Memory-efficient columnar format
- Native expression API
Best For:
- Large CSV/Parquet files (>100MB)
- Complex transformations
- Aggregation queries
- Columnar operations
Example Usage:
# Polars expression query
{
"engine": "polars",
"query_type": "polars_expr",
"query": """
pl.scan_csv('data.csv')
.filter(pl.col('revenue') > 1000)
.group_by('category')
.agg([
pl.col('revenue').sum().alias('total_revenue'),
pl.col('orders').count().alias('order_count')
])
"""
}Strengths:
- Full SQL support
- CSV-native reading
- Efficient joins
- Window functions
- Familiar syntax
Best For:
- SQL-heavy workflows
- Cross-file analysis (glob patterns)
- Complex joins
- Window analytics
Example Usage:
-- DuckDB SQL query
{
"engine": "duckdb",
"query_type": "sql",
"query": """
SELECT
year,
SUM(revenue) as total_revenue,
AVG(quantity) as avg_quantity,
LAG(SUM(revenue)) OVER (ORDER BY year) as prev_year
FROM read_csv_auto('sales_*.csv')
GROUP BY year
"""
}Strengths:
- Maximum ecosystem compatibility
- Rich visualization libraries
- Machine learning integration
- Most comprehensive API
Best For:
- ML preprocessing
- Custom visualizations
- Complex data manipulation
- Libraries requiring pandas
Example Usage:
# Pandas Python query
{
"engine": "pandas",
"query_type": "python",
"query": """
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('data.csv')
scaler = StandardScaler()
df['normalized'] = scaler.fit_transform(df[['value']])
result = df.describe()
"""
}User: "Проанализируй sales.csv (500MB)"
│
▼
LobeHub: [Detects large file]
│
▼
MCP: profile_dataset(sales.csv)
│
▼
LobeHub: "Обнаружено 10M строк, 25 колонок.
Ключевые метрики: [агрегаты]"
│
▼
User: "Покажи топ-10 категорий по выручке"
│
▼
MCP: execute_query(aggregation, limit=10, chart=bar)
│
▼
LobeHub: "[Таблица + график]"
Context used: ~800 tokens vs 2M+ tokens
User: "Сравни метрики 2023 vs 2024"
│
▼
MCP: execute_query(
engine="duckdb",
query="SELECT year, SUM(revenue)... FROM read_csv_auto('sales_*.csv')..."
)
│
▼
LobeHub: "2023: $24.5M, 2024: $31.2M (+27%)"
Context used: ~300 tokens
Step 1: profile_dataset → Schema, stats (~250 tokens)
Step 2: stream_sample → Representative rows (~800 tokens)
Step 3: execute_query (filter) → Filtered subset (~400 tokens)
Step 4: execute_query (agg) → Aggregations (~500 tokens)
Step 5: export_result → Save to workspace (~50 tokens)
─────────────────────────
Total: ~2000 tokens vs 100k+ for full file
| Code | Category | Description |
|---|---|---|
| 4001 | Invalid Input | Malformed query or parameters |
| 4002 | File Not Found | Dataset file doesn't exist |
| 4003 | Schema Error | Schema mismatch or invalid column |
| 4004 | Query Error | Syntax error in query |
| 5001 | Engine Error | Execution engine failure |
| 5002 | Memory Error | Out of memory during execution |
| 5003 | Timeout | Query exceeded time limit |
| 5004 | Cache Error | Cache operation failed |
{
"error": {
"code": 4004,
"category": "Query Error",
"message": "Column 'revenu' not found. Did you mean 'revenue'?",
"suggestions": ["Check column names with profile_dataset", "Use schema information"],
"query_context": "..."
}
}{
"mcpServers": {
"code-server-analytics": {
"command": "python",
"args": ["-m", "code_server_mcp.analytics"],
"env": {
"POLARS_MAX_THREADS": "4",
"DUCKDB_MEMORY_LIMIT": "2GB",
"CACHE_MAX_SIZE": "1GB",
"CACHE_TTL_SECONDS": "3600",
"MAX_QUERY_TIME_MS": "30000",
"STREAMING_THRESHOLD_MB": "100"
}
}
}
}Auto Engine Selection Logic:
- File size < 10MB → Pandas (ecosystem compatibility)
- SQL-like query → DuckDB
- Complex transforms → Polars
- Multiple files → DuckDB
- ML operations → Pandas
| Metric | Threshold | Action |
|---|---|---|
| Query time | > 30s | Timeout, return partial |
| Memory usage | > 2GB | Switch to streaming |
| Result rows | > 10,000 | Truncate + warning |
| Context tokens | > 2000 | Compress + summary |
| File size | > 1GB | Force streaming mode |
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2026-03-17 | Initial analytics specification |
| 2.0.1 | Planned | Multi-engine query optimization |
| 2.1.0 | Planned | Real-time streaming support |
| 2.2.0 | Planned | Advanced caching strategies |
- ADR-001: Analytics Architecture
- ADR-002: Context Management Strategy
- ADR-003: Caching and Query Optimization
- LobeHub v2 Integration Guide