Note: Tested with Crawl4AI version 0.7.4
TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.
- Prerequisites
- Quick Start
- Configuration
- Client-Specific Instructions
- Available Tools
- Advanced Configuration
- Development
- License
- Node.js 18+ and npm
- A running Crawl4AI server
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}{
"mcpServers": {
"crawl4ai": {
"command": "node",
"args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235",
"CRAWL4AI_API_KEY": "your-api-key",
"SERVER_NAME": "custom-name",
"SERVER_VERSION": "1.0.0"
}
}
}
}# Required
CRAWL4AI_BASE_URL=http://localhost:11235
# Optional - Server Configuration
CRAWL4AI_API_KEY= # If your server requires auth
SERVER_NAME=crawl4ai-mcp # Custom name for the MCP server
SERVER_VERSION=1.0.0 # Custom versionAdd to ~/Library/Application Support/Claude/claude_desktop_config.json
claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-tsConsult your client's documentation for MCP server configuration. The key details:
- Command:
npx mcp-crawl4ai-tsornode /path/to/dist/index.js - Required env:
CRAWL4AI_BASE_URL - Optional env:
CRAWL4AI_API_KEY,SERVER_NAME,SERVER_VERSION
{
url: string, // Required: URL to extract markdown from
filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit')
query?: string, // Query for bm25/llm filters
cache?: string // Cache-bust parameter (default: '0')
}Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.
{
url: string, // Required: URL to capture
screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2)
}Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl with screenshot: true.
{
url: string // Required: URL to convert to PDF
}Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl with pdf: true.
{
url: string, // Required: URL to load
scripts: string | string[] // Required: JavaScript to execute
}Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl with js_code.
{
urls: string[], // Required: List of URLs to crawl
max_concurrent?: number, // Parallel request limit (default: 5)
remove_images?: boolean, // Remove images from output (default: false)
bypass_cache?: boolean, // Bypass cache for all URLs (default: false)
configs?: Array<{ // Optional: Per-URL configurations (v3.0.0+)
url: string,
[key: string]: any // Any crawl parameters for this specific URL
}>
}Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With configs array, you can specify different parameters for each URL.
{
url: string, // Required: URL to crawl
max_depth?: number, // Maximum depth for recursive crawling (default: 2)
follow_links?: boolean, // Follow links in content (default: true)
bypass_cache?: boolean // Bypass cache (default: false)
}Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.
{
url: string // Required: URL to extract HTML from
}Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.
{
url: string, // Required: URL to extract links from
categorize?: boolean // Group by type (default: true)
}Extracts all links and groups them by type: internal, external, social media, documents, images.
{
url: string, // Required: Starting URL
max_depth?: number, // Maximum depth to crawl (default: 3)
max_pages?: number, // Maximum pages to crawl (default: 50)
include_pattern?: string, // Regex pattern for URLs to include
exclude_pattern?: string // Regex pattern for URLs to exclude
}Crawls a website following internal links up to specified depth. Returns content from all discovered pages.
{
url: string, // Required: Sitemap URL (e.g., /sitemap.xml)
filter_pattern?: string // Optional: Regex pattern to filter URLs
}Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.
{
url: string, // URL to crawl
// Browser Configuration
browser_type?: 'chromium'|'firefox'|'webkit'|'undetected', // Browser engine (undetected = stealth mode)
viewport_width?: number, // Browser width (default: 1080)
viewport_height?: number, // Browser height (default: 600)
user_agent?: string, // Custom user agent
proxy_server?: string | { // Proxy URL (string or object format)
server: string,
username?: string,
password?: string
},
proxy_username?: string, // Proxy auth (if using string format)
proxy_password?: string, // Proxy password (if using string format)
cookies?: Array<{name, value, domain}>, // Pre-set cookies
headers?: Record<string,string>, // Custom headers
// Crawler Configuration
word_count_threshold?: number, // Min words per block (default: 200)
excluded_tags?: string[], // HTML tags to exclude
remove_overlay_elements?: boolean, // Remove popups/modals
js_code?: string | string[], // JavaScript to execute
wait_for?: string, // Wait condition (selector or JS)
wait_for_timeout?: number, // Wait timeout (default: 30000)
delay_before_scroll?: number, // Pre-scroll delay
scroll_delay?: number, // Between-scroll delay
process_iframes?: boolean, // Include iframe content
exclude_external_links?: boolean, // Remove external links
screenshot?: boolean, // Capture screenshot
pdf?: boolean, // Generate PDF
session_id?: string, // Reuse browser session (only works with crawl tool)
cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control
// New in v3.0.0 (Crawl4AI 0.7.3/0.7.4)
css_selector?: string, // CSS selector to filter content
delay_before_return_html?: number, // Delay in seconds before returning HTML
include_links?: boolean, // Include extracted links in response
resolve_absolute_urls?: boolean, // Convert relative URLs to absolute
// LLM Extraction (REST API only supports 'llm' type)
extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API
extraction_schema?: object, // Schema for structured extraction
extraction_instruction?: string, // Natural language extraction prompt
extraction_strategy?: { // Advanced extraction configuration
provider?: string,
api_key?: string,
model?: string,
[key: string]: any
},
table_extraction_strategy?: { // Table extraction configuration
enable_chunking?: boolean,
thresholds?: object,
[key: string]: any
},
markdown_generator_options?: { // Markdown generation options
include_links?: boolean,
preserve_formatting?: boolean,
[key: string]: any
},
timeout?: number, // Overall timeout (default: 60000)
verbose?: boolean // Detailed logging
}{
action: 'create' | 'clear' | 'list', // Required: Action to perform
session_id?: string, // For 'create' and 'clear' actions
initial_url?: string, // For 'create' action: URL to load
browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected' // For 'create' action
}Unified tool for managing browser sessions. Supports three actions:
- create: Start a persistent browser session
- clear: Remove a session from local tracking
- list: Show all active sessions
Examples:
// Create a new session
{ action: 'create', session_id: 'my-session', initial_url: 'https://example.com' }
// Clear a session
{ action: 'clear', session_id: 'my-session' }
// List all sessions
{ action: 'list' }{
url: string, // URL to extract data from
query: string // Natural language extraction instructions
}Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.
For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:
See CHANGELOG.md for detailed version history and recent updates.
# 1. Start the Crawl4AI server
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# 2. Install MCP server
git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git
cd mcp-crawl4ai-ts
npm install
cp .env.example .env
# 3. Development commands
npm run dev # Development mode
npm test # Run tests
npm run lint # Check code quality
npm run build # Production build
# 4. Add to your MCP client (See "Using local installation")Integration tests require a running Crawl4AI server. Configure your environment:
# Required for integration tests
export CRAWL4AI_BASE_URL=http://localhost:11235
export CRAWL4AI_API_KEY=your-api-key # If authentication is required
# Optional: For LLM extraction tests
export LLM_PROVIDER=openai/gpt-4o-mini
export LLM_API_TOKEN=your-llm-api-key
export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint
# Run integration tests (ALWAYS use the npm script; don't call `jest` directly)
npm run test:integration
# Run a single integration test file
npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts
> IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang.Integration tests cover:
- Dynamic content and JavaScript execution
- Session management and cookies
- Content extraction (LLM-based only)
- Media handling (screenshots, PDFs)
- Performance and caching
- Content filtering
- Bot detection avoidance
- Error handling
- Docker container healthy:
docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}'
curl -sf http://localhost:11235/health || echo "Health check failed"- Env vars loaded (either exported or in
.env):CRAWL4AI_BASE_URL(required), optional:CRAWL4AI_API_KEY,LLM_PROVIDER,LLM_API_TOKEN,LLM_BASE_URL. - Use
npm run test:integration(never rawjest). - To target one file add it after
--(see example above). - Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing
NODE_OPTIONSor wrong Jest version.
| Symptom | Likely Cause | Fix |
|---|---|---|
SyntaxError: Cannot use import statement outside a module |
Ran jest directly without script flags |
Re-run with npm run test:integration |
| Hangs on first test (RUNS ...) | Missing experimental VM modules flag | Use npm script / ensure NODE_OPTIONS=--experimental-vm-modules |
| Network timeouts | Crawl4AI container not healthy / DNS blocked | Restart container: docker restart <name> |
| LLM tests skipped | Missing LLM_PROVIDER or LLM_API_TOKEN |
Export required LLM vars |
| New Jest major upgrade breaks tests | Version mismatch with ts-jest |
Keep Jest 29.x unless ts-jest upgraded accordingly |
Current stack: jest@29.x + ts-jest@29.x + ESM ("type": "module"). Updating Jest to 30+ requires upgrading ts-jest and revisiting jest.config.cjs. Keep versions aligned to avoid parse errors.
MIT