Skip to content

BDuba/code-server-mcp

Repository files navigation

MCP Analytics Server for LobeHub v2

A high-performance analytics skill for LobeHub v2 that enables processing large-scale datasets with minimal context window usage. Built on the Model Context Protocol (MCP), this server brings the power of Polars, DuckDB, and Pandas to your AI conversations.

✨ Key Features

🚀 High-Performance Analytics

  • Multi-Engine Support: Polars (primary), DuckDB (SQL), Pandas (fallback)
  • 10-50x Faster: Polars engine outperforms pandas on large datasets
  • Streaming Processing: Handle files >10GB without loading into memory
  • Lazy Evaluation: Query optimization before execution

🧠 Smart Context Management

  • Token Budgeting: Automatic optimization for LobeHub's context window limits
  • Adaptive Compression: Four levels of compression based on context pressure
  • Intelligent Sampling: Get representative data samples without overwhelming the context
  • ~90% Token Reduction: Process 50MB+ files using <2000 tokens in context

📊 Advanced Analytics Tools

  • Dataset Profiling: Understand schema, statistics, and data quality
  • SQL Queries: Full DuckDB SQL support with cross-file analysis
  • Data Sampling: Head, random, and stratified sampling strategies
  • Result Export: CSV, Parquet, JSON, Excel formats
  • Automatic Engine Selection: Chooses optimal engine based on file size and query type
  • Direct File Download: ⭐ NEW — Download files from S3/URLs in seconds (500MB = ~50 tokens)

🔒 Enterprise Security

  • Docker Isolation: Each session in separate container
  • No Network Access: Air-gapped execution
  • Resource Limits: 512MB RAM, 1 CPU per container
  • Non-root Execution: Unprivileged sandbox user
  • Auto-cleanup: Sessions expire after 30 minutes

🎯 Why Use This in LobeHub?

The Problem

LobeHub v2 has fundamental limitations with large data:

  • Context window: ~40k-200k tokens
  • Performance degradation at >40k tokens
  • Large CSV files (>500KB) overwhelm the context

Our Solution

User: "Analyze this 500MB sales.csv file"

Traditional approach:
❌ Load entire file into context (millions of tokens)
❌ Context overflow, slow responses, high costs

Our approach:
✅ Profile file in Docker (200 tokens for metadata)
✅ Execute query in isolated environment
✅ Return only aggregated results (500 tokens)
✅ Total: ~700 tokens vs millions

Real-World Use Cases

📈 Sales Analysis

"Analyze sales_2024.csv (2GB file)"
→ Profile: Row count, columns, data types
→ Query: Top 10 products by revenue
→ Compare: 2023 vs 2024 sales
→ Export: Results to Excel

🔬 Scientific Computing

"Process experiment_data.csv with 10M rows"
→ Sample: Get representative subset
→ Filter: Specific conditions
→ Aggregate: Statistical summaries
→ Visualize: Charts (if needed)

📊 Business Intelligence

"Join customer.csv with orders.csv"
→ Cross-file analysis with DuckDB
→ Calculate: Customer lifetime value
→ Segment: By region and category

📋 Prerequisites

  • Docker 20.10+ installed and running
  • Node.js 18+ and npm
  • RAM: 4GB+ recommended for large file processing
  • Disk: 10GB+ for Docker images and caches

🚀 Quick Start

Installation

# Clone repository
git clone https://github.com/BDuba/code-server-mcp.git
cd code-server-mcp

# Install dependencies
npm install

# Build TypeScript
npm run build

# Build Docker image (optional - server auto-builds on first use)
npm run docker:build

# Start server
npm run start:http

LobeHub Configuration

  1. Start the server:
npm run start:http

Server runs on http://172.17.0.1:8004

  1. Go to LobeHub SettingsMCP ServersAdd Server

  2. Add configuration:

{
  "mcpServers": {
    "analytics": {
      "url": "http://172.17.0.1:8004",
      "type": "http"
    }
  }
}
  1. Click Test to verify connection

🔧 Available Tools

Analytics Tools

analytics_profile_dataset

Analyze dataset structure and statistics without loading data into context.

Example:

{
  "name": "analytics_profile_dataset",
  "arguments": {
    "filePath": "/workspace/sales.csv",
    "sampleSize": 5000
  }
}

Response:

{
  "schema": {
    "columns": [
      {"name": "revenue", "type": "float64", "nullPct": 0.02},
      {"name": "category", "type": "string", "uniqueCount": 15}
    ]
  },
  "statistics": {
    "rowCount": 2500000,
    "memoryEstimate": "120MB"
  },
  "recommendations": [
    "Consider category dtype for 'status' column"
  ]
}

analytics_execute_query

Execute analytical queries with automatic engine selection.

Polars Example:

{
  "name": "analytics_execute_query",
  "arguments": {
    "engine": "polars",
    "queryType": "polars_expr",
    "query": "pl.scan_csv('sales.csv').group_by('category').agg(pl.col('revenue').sum())",
    "files": ["sales.csv"],
    "returnLimit": 50
  }
}

DuckDB SQL Example:

{
  "name": "analytics_execute_query",
  "arguments": {
    "engine": "duckdb",
    "queryType": "sql",
    "query": "SELECT category, SUM(revenue) FROM read_csv_auto('sales_*.csv') GROUP BY category",
    "files": ["sales_2023.csv", "sales_2024.csv"],
    "returnLimit": 100
  }
}

Response:

{
  "resultType": "tabular",
  "data": [
    {"category": "Electronics", "revenue": 1500000},
    {"category": "Clothing", "revenue": 890000}
  ],
  "summary": {
    "rowsProcessed": 2500000,
    "executionTimeMs": 450,
    "engineUsed": "duckdb"
  }
}

analytics_stream_sample

Get representative data samples for context inspection.

Example:

{
  "name": "analytics_stream_sample",
  "arguments": {
    "filePath": "/workspace/customers.csv",
    "strategy": "random",
    "sampleSize": 20,
    "columns": ["name", "segment", "revenue"]
  }
}

Core Tools

create_session

Create isolated sandbox session.

execute_code

Execute Python/JavaScript/TypeScript code.

write_file / read_file / list_files

File operations in session workspace.

destroy_session

Clean up session resources.

download_file_from_url ⭐ NEW

Download files directly from URLs (S3, HTTP, HTTPS) into the session workspace.

Example:

{
  "name": "download_file_from_url",
  "arguments": {
    "sessionId": "your-session-id",
    "url": "https://lobechat.hb.ru-msk.vkcloud-storage.ru/files/.../data.csv?X-Amz-...",
    "filename": "data.csv",
    "headers": {
      "User-Agent": "MCP-Client/1.0"
    }
  }
}

Response:

File downloaded successfully: data.csv (5242880 bytes)

Why use this?

  • Download 500MB+ files in seconds (one API call)
  • Supports S3 pre-signed URLs
  • No token overhead vs write_file line-by-line
  • Automatic HTTP header support

📊 Working with Large Data

Pattern 1: Profile → Query → Export

# Step 1: Profile the dataset
profile_dataset("sales.csv")
# → Returns: 2.5M rows, 25 columns, ~120MB

# Step 2: Execute targeted query
execute_query(
    engine="duckdb",
    query="SELECT category, SUM(revenue) FROM sales.csv GROUP BY category"
)
# → Returns: Aggregated results (15 rows)

# Step 3: Export if needed
export_result(query_id="q1", format="parquet")

Pattern 2: Iterative Exploration

# Step 1: Get sample
stream_sample("data.csv", strategy="random", sample_size=20)
# → Context impact: ~800 tokens

# Step 2: Profile specific columns
profile_dataset("data.csv")
# → Context impact: ~200 tokens

# Step 3: Drill-down query
execute_query(query="SELECT * WHERE revenue > 1000")
# → Context impact: ~500 tokens

# Total: ~1500 tokens vs 2M+ for full file

Pattern 3: Cross-File Analysis

# Analyze multiple files with glob patterns
execute_query(
    engine="duckdb",
    query="""
        SELECT 
            year,
            SUM(revenue) as total_revenue,
            AVG(quantity) as avg_quantity
        FROM read_csv_auto('sales_*.csv')
        GROUP BY year
    """,
    files=["sales_2023.csv", "sales_2024.csv"]
)

Pattern 4: Download Large Files from URL ⭐ NEW

# Step 1: Create session
session = create_session()

# Step 2: Download file from S3/URL (one call, no token overhead!)
download_file_from_url(
    sessionId=session.id,
    url="https://lobechat.../sales_2024.csv?X-Amz-...",
    filename="sales.csv"
)
# → Downloads 500MB file in seconds (~50 tokens in context)
# → vs millions of tokens with write_file line-by-line

# Step 3: Profile and analyze immediately
profile_dataset("sales.csv")
execute_query(query="SELECT * FROM sales.csv LIMIT 10")

Benefits:

  • Fast: Download 500MB in seconds, not minutes
  • 💰 Cheap: ~50 tokens vs millions for write_file
  • 🔒 Secure: File downloaded to host, then mounted to isolated container
  • 🌐 Universal: Works with S3, HTTP, HTTPS, any URL

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        LobeHub v2                                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Context Window (~50k tokens)                            │  │
│  │  • Schema metadata (~300 tokens)                         │  │
│  │  • Query results (~1500 tokens)                          │  │
│  │  • Sample data (~800 tokens)                             │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │ MCP Protocol
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                   MCP Analytics Server                           │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │  Engine Selector│  │ Context Manager │  │  Analytics      │  │
│  │  • Auto-select  │  │ • Token budgets │  │  Service        │  │
│  │  • Fallback     │  │ • Compression   │  │                 │  │
│  └────────┬────────┘  └─────────────────┘  └────────┬────────┘  │
└───────────┼──────────────────────────────────────────┼───────────┘
            │                                          │
    ┌───────┴───────┐                         ┌────────┴────────┐
    ▼               ▼                         ▼                 ▼
┌────────┐  ┌────────────┐            ┌────────────┐  ┌──────────┐
│ Polars │  │   DuckDB   │            │  Pandas    │  │  Viz     │
│(Primary)│  │(Secondary)│            │ (Fallback) │  │ (3-tier) │
└────────┘  └────────────┘            └────────────┘  └──────────┘

Engine Selection Logic

File Size Query Type Selected Engine Reason
< 10MB Python Pandas Ecosystem compatibility
> 100MB Any Polars Memory efficiency
SQL SQL DuckDB Native SQL support
Multiple files Any DuckDB Cross-file joins
ML operations Python Pandas sklearn integration

📈 Performance Benchmarks

Query Execution Speed

Dataset Size Pandas Polars DuckDB Improvement
100K rows 1.2s 0.15s 0.25s 8x faster
1M rows 12s 0.8s 1.5s 15x faster
10M rows OOM 8s 15s Streaming

Context Token Usage

Operation Traditional Our Approach Savings
Profile 10M rows 5M tokens 250 tokens 99.9%
Aggregate query 2M tokens 500 tokens 99.9%
Sample 100 rows 50K tokens 800 tokens 98%

🧪 Testing

# Run all tests
npm test

# Unit tests only
npm run test:unit

# E2E tests (requires Docker)
npm run test:e2e

# Specific test file
npm run test:e2e -- tests/e2e/Analytics.test.ts

Current Status:

  • ✅ 60/60 E2E tests passing
  • ✅ 14/14 unit tests passing
  • ✅ TypeScript compilation: No errors

🔍 Troubleshooting

"Engine not available"

Cause: Docker image doesn't have Polars/DuckDB installed
Solution: Rebuild Docker image: npm run docker:build

"Query timeout"

Cause: Query too complex or file too large
Solution: Use sampling or increase limit in engine config

"Out of memory"

Cause: File too large for Pandas
Solution: Polars will be auto-selected for large files

Connection issues in LobeHub

# Check server is running
curl http://172.17.0.1:8004/health

# Verify Docker image
docker images | grep mcp-code-execution

"MCP error -32603: no such container - No such image"

Cause: Docker image was removed by system cleanup scripts
Solution: The server now auto-builds the image on first use. To manually rebuild:

npm run docker:build

Note: The Docker image is now protected with label mcp.keep=true to prevent automatic cleanup by maintenance scripts.

📚 Documentation

🛣️ Roadmap

Phase 1: MVP ✅ (Current)

  • ✅ Multi-engine support (Polars/DuckDB/Pandas)
  • ✅ Basic analytics tools
  • ✅ Context management
  • ✅ Auto engine selection
  • ✅ Docker image auto-build and protection

Phase 2: Advanced (Planned)

  • 🔄 Multi-tier caching (L1/L2)
  • 🔄 Visualization manager (3-tier strategy)
  • 🔄 Query optimization
  • 🔄 Smart prefetching

Phase 3: Enterprise (Planned)

  • 📋 Distributed processing
  • 📋 Real-time streaming
  • 📋 Advanced security features

🤝 Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open Pull Request

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments


Made with ❤️ for the LobeHub community

Status: Production Ready ✅ | Version: 2.0.0 | Tests: 60/60 Passing

About

Token‑efficient MCP server that gives AI agents a secure, containerized VS Code/code‑server workspace for coding, debugging, and automating development tasks via HTTP and CLI

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages