A high-performance analytics skill for LobeHub v2 that enables processing large-scale datasets with minimal context window usage. Built on the Model Context Protocol (MCP), this server brings the power of Polars, DuckDB, and Pandas to your AI conversations.
- Multi-Engine Support: Polars (primary), DuckDB (SQL), Pandas (fallback)
- 10-50x Faster: Polars engine outperforms pandas on large datasets
- Streaming Processing: Handle files >10GB without loading into memory
- Lazy Evaluation: Query optimization before execution
- Token Budgeting: Automatic optimization for LobeHub's context window limits
- Adaptive Compression: Four levels of compression based on context pressure
- Intelligent Sampling: Get representative data samples without overwhelming the context
- ~90% Token Reduction: Process 50MB+ files using <2000 tokens in context
- Dataset Profiling: Understand schema, statistics, and data quality
- SQL Queries: Full DuckDB SQL support with cross-file analysis
- Data Sampling: Head, random, and stratified sampling strategies
- Result Export: CSV, Parquet, JSON, Excel formats
- Automatic Engine Selection: Chooses optimal engine based on file size and query type
- Direct File Download: ⭐ NEW — Download files from S3/URLs in seconds (500MB = ~50 tokens)
- Docker Isolation: Each session in separate container
- No Network Access: Air-gapped execution
- Resource Limits: 512MB RAM, 1 CPU per container
- Non-root Execution: Unprivileged sandbox user
- Auto-cleanup: Sessions expire after 30 minutes
LobeHub v2 has fundamental limitations with large data:
- Context window: ~40k-200k tokens
- Performance degradation at >40k tokens
- Large CSV files (>500KB) overwhelm the context
User: "Analyze this 500MB sales.csv file"
Traditional approach:
❌ Load entire file into context (millions of tokens)
❌ Context overflow, slow responses, high costs
Our approach:
✅ Profile file in Docker (200 tokens for metadata)
✅ Execute query in isolated environment
✅ Return only aggregated results (500 tokens)
✅ Total: ~700 tokens vs millions
"Analyze sales_2024.csv (2GB file)"
→ Profile: Row count, columns, data types
→ Query: Top 10 products by revenue
→ Compare: 2023 vs 2024 sales
→ Export: Results to Excel
"Process experiment_data.csv with 10M rows"
→ Sample: Get representative subset
→ Filter: Specific conditions
→ Aggregate: Statistical summaries
→ Visualize: Charts (if needed)
"Join customer.csv with orders.csv"
→ Cross-file analysis with DuckDB
→ Calculate: Customer lifetime value
→ Segment: By region and category
- Docker 20.10+ installed and running
- Node.js 18+ and npm
- RAM: 4GB+ recommended for large file processing
- Disk: 10GB+ for Docker images and caches
# Clone repository
git clone https://github.com/BDuba/code-server-mcp.git
cd code-server-mcp
# Install dependencies
npm install
# Build TypeScript
npm run build
# Build Docker image (optional - server auto-builds on first use)
npm run docker:build
# Start server
npm run start:http- Start the server:
npm run start:httpServer runs on http://172.17.0.1:8004
-
Go to LobeHub Settings → MCP Servers → Add Server
-
Add configuration:
{
"mcpServers": {
"analytics": {
"url": "http://172.17.0.1:8004",
"type": "http"
}
}
}- Click Test to verify connection
Analyze dataset structure and statistics without loading data into context.
Example:
{
"name": "analytics_profile_dataset",
"arguments": {
"filePath": "/workspace/sales.csv",
"sampleSize": 5000
}
}Response:
{
"schema": {
"columns": [
{"name": "revenue", "type": "float64", "nullPct": 0.02},
{"name": "category", "type": "string", "uniqueCount": 15}
]
},
"statistics": {
"rowCount": 2500000,
"memoryEstimate": "120MB"
},
"recommendations": [
"Consider category dtype for 'status' column"
]
}Execute analytical queries with automatic engine selection.
Polars Example:
{
"name": "analytics_execute_query",
"arguments": {
"engine": "polars",
"queryType": "polars_expr",
"query": "pl.scan_csv('sales.csv').group_by('category').agg(pl.col('revenue').sum())",
"files": ["sales.csv"],
"returnLimit": 50
}
}DuckDB SQL Example:
{
"name": "analytics_execute_query",
"arguments": {
"engine": "duckdb",
"queryType": "sql",
"query": "SELECT category, SUM(revenue) FROM read_csv_auto('sales_*.csv') GROUP BY category",
"files": ["sales_2023.csv", "sales_2024.csv"],
"returnLimit": 100
}
}Response:
{
"resultType": "tabular",
"data": [
{"category": "Electronics", "revenue": 1500000},
{"category": "Clothing", "revenue": 890000}
],
"summary": {
"rowsProcessed": 2500000,
"executionTimeMs": 450,
"engineUsed": "duckdb"
}
}Get representative data samples for context inspection.
Example:
{
"name": "analytics_stream_sample",
"arguments": {
"filePath": "/workspace/customers.csv",
"strategy": "random",
"sampleSize": 20,
"columns": ["name", "segment", "revenue"]
}
}Create isolated sandbox session.
Execute Python/JavaScript/TypeScript code.
File operations in session workspace.
Clean up session resources.
Download files directly from URLs (S3, HTTP, HTTPS) into the session workspace.
Example:
{
"name": "download_file_from_url",
"arguments": {
"sessionId": "your-session-id",
"url": "https://lobechat.hb.ru-msk.vkcloud-storage.ru/files/.../data.csv?X-Amz-...",
"filename": "data.csv",
"headers": {
"User-Agent": "MCP-Client/1.0"
}
}
}Response:
File downloaded successfully: data.csv (5242880 bytes)
Why use this?
- Download 500MB+ files in seconds (one API call)
- Supports S3 pre-signed URLs
- No token overhead vs write_file line-by-line
- Automatic HTTP header support
# Step 1: Profile the dataset
profile_dataset("sales.csv")
# → Returns: 2.5M rows, 25 columns, ~120MB
# Step 2: Execute targeted query
execute_query(
engine="duckdb",
query="SELECT category, SUM(revenue) FROM sales.csv GROUP BY category"
)
# → Returns: Aggregated results (15 rows)
# Step 3: Export if needed
export_result(query_id="q1", format="parquet")# Step 1: Get sample
stream_sample("data.csv", strategy="random", sample_size=20)
# → Context impact: ~800 tokens
# Step 2: Profile specific columns
profile_dataset("data.csv")
# → Context impact: ~200 tokens
# Step 3: Drill-down query
execute_query(query="SELECT * WHERE revenue > 1000")
# → Context impact: ~500 tokens
# Total: ~1500 tokens vs 2M+ for full file# Analyze multiple files with glob patterns
execute_query(
engine="duckdb",
query="""
SELECT
year,
SUM(revenue) as total_revenue,
AVG(quantity) as avg_quantity
FROM read_csv_auto('sales_*.csv')
GROUP BY year
""",
files=["sales_2023.csv", "sales_2024.csv"]
)# Step 1: Create session
session = create_session()
# Step 2: Download file from S3/URL (one call, no token overhead!)
download_file_from_url(
sessionId=session.id,
url="https://lobechat.../sales_2024.csv?X-Amz-...",
filename="sales.csv"
)
# → Downloads 500MB file in seconds (~50 tokens in context)
# → vs millions of tokens with write_file line-by-line
# Step 3: Profile and analyze immediately
profile_dataset("sales.csv")
execute_query(query="SELECT * FROM sales.csv LIMIT 10")Benefits:
- ⚡ Fast: Download 500MB in seconds, not minutes
- 💰 Cheap: ~50 tokens vs millions for write_file
- 🔒 Secure: File downloaded to host, then mounted to isolated container
- 🌐 Universal: Works with S3, HTTP, HTTPS, any URL
┌─────────────────────────────────────────────────────────────────┐
│ LobeHub v2 │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Context Window (~50k tokens) │ │
│ │ • Schema metadata (~300 tokens) │ │
│ │ • Query results (~1500 tokens) │ │
│ │ • Sample data (~800 tokens) │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│ MCP Protocol
▼
┌─────────────────────────────────────────────────────────────────┐
│ MCP Analytics Server │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Engine Selector│ │ Context Manager │ │ Analytics │ │
│ │ • Auto-select │ │ • Token budgets │ │ Service │ │
│ │ • Fallback │ │ • Compression │ │ │ │
│ └────────┬────────┘ └─────────────────┘ └────────┬────────┘ │
└───────────┼──────────────────────────────────────────┼───────────┘
│ │
┌───────┴───────┐ ┌────────┴────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐
│ Polars │ │ DuckDB │ │ Pandas │ │ Viz │
│(Primary)│ │(Secondary)│ │ (Fallback) │ │ (3-tier) │
└────────┘ └────────────┘ └────────────┘ └──────────┘
| File Size | Query Type | Selected Engine | Reason |
|---|---|---|---|
| < 10MB | Python | Pandas | Ecosystem compatibility |
| > 100MB | Any | Polars | Memory efficiency |
| SQL | SQL | DuckDB | Native SQL support |
| Multiple files | Any | DuckDB | Cross-file joins |
| ML operations | Python | Pandas | sklearn integration |
| Dataset Size | Pandas | Polars | DuckDB | Improvement |
|---|---|---|---|---|
| 100K rows | 1.2s | 0.15s | 0.25s | 8x faster |
| 1M rows | 12s | 0.8s | 1.5s | 15x faster |
| 10M rows | OOM | 8s | 15s | Streaming |
| Operation | Traditional | Our Approach | Savings |
|---|---|---|---|
| Profile 10M rows | 5M tokens | 250 tokens | 99.9% |
| Aggregate query | 2M tokens | 500 tokens | 99.9% |
| Sample 100 rows | 50K tokens | 800 tokens | 98% |
# Run all tests
npm test
# Unit tests only
npm run test:unit
# E2E tests (requires Docker)
npm run test:e2e
# Specific test file
npm run test:e2e -- tests/e2e/Analytics.test.tsCurrent Status:
- ✅ 60/60 E2E tests passing
- ✅ 14/14 unit tests passing
- ✅ TypeScript compilation: No errors
Cause: Docker image doesn't have Polars/DuckDB installed
Solution: Rebuild Docker image: npm run docker:build
Cause: Query too complex or file too large
Solution: Use sampling or increase limit in engine config
Cause: File too large for Pandas
Solution: Polars will be auto-selected for large files
# Check server is running
curl http://172.17.0.1:8004/health
# Verify Docker image
docker images | grep mcp-code-executionCause: Docker image was removed by system cleanup scripts
Solution: The server now auto-builds the image on first use. To manually rebuild:
npm run docker:buildNote: The Docker image is now protected with label mcp.keep=true to prevent automatic cleanup by maintenance scripts.
- MCP Specification - Detailed API reference
- ADR-001 - Analytics architecture
- ADR-002 - Context management strategy
- ADR-003 - Caching strategy
- ADR-004 - Visualization approach
- Test Cases - Comprehensive test documentation
- ✅ Multi-engine support (Polars/DuckDB/Pandas)
- ✅ Basic analytics tools
- ✅ Context management
- ✅ Auto engine selection
- ✅ Docker image auto-build and protection
- 🔄 Multi-tier caching (L1/L2)
- 🔄 Visualization manager (3-tier strategy)
- 🔄 Query optimization
- 🔄 Smart prefetching
- 📋 Distributed processing
- 📋 Real-time streaming
- 📋 Advanced security features
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
MIT License - see LICENSE file for details.
- Polars - Lightning-fast DataFrame library
- DuckDB - In-process analytical database
- Model Context Protocol - MCP specification
- LobeHub - AI chat platform
- Docker - Containerization platform
Made with ❤️ for the LobeHub community
Status: Production Ready ✅ | Version: 2.0.0 | Tests: 60/60 Passing