Cross-platform AI agent system for autonomous desktop and web automation.
Pilot enables AI to control your computer like a human would. Describe what you want in natural language, and the system figures out how to do it — across browsers, desktop apps, and the terminal. Uses multi-agent orchestration with specialist agents for web automation, GUI control, system commands, and code generation.
Get running in 3 commands:
# Clone and install
git clone https://github.com/lahfir/pilot.git && cd pilot
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --extra macos # or windows/linux
# Configure (create .env file)
echo "LLM_PROVIDER=google
LLM_MODEL=gemini-2.0-flash-exp
GOOGLE_API_KEY=your_key_here" > .env
# Run
uv run python -m pilot.mainmacOS users: Grant accessibility permissions in System Settings → Privacy & Security → Accessibility (add Terminal to the list).
git clone https://github.com/lahfir/pilot.git && cd pilot
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --extra macosPermissions: System Settings → Privacy & Security → Accessibility → Add Terminal (or your IDE)
git clone https://github.com/lahfir/pilot.git && cd pilot
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
uv sync --extra windowsPermissions: Run terminal as Administrator for full UI Automation access
git clone https://github.com/lahfir/pilot.git && cd pilot
curl -LsSf https://astral.sh/uv/install.sh | sh
sudo apt-get install -y python3-pyatspi python3-xlib
uv sync --extra linuxNote: Requires X11 (Wayland support is limited)
git clone https://github.com/lahfir/pilot.git && cd pilot
./install.shThe installer detects your platform, installs dependencies, and configures permissions.
uv run python -m pilot.main [OPTIONS]| Flag | Description |
|---|---|
-v, --verbose |
Enable detailed logging output |
-q, --quiet |
Minimal output, no dashboard (useful for scripts) |
--voice-input |
Start with voice input mode enabled (toggle with F5) |
--use-browser-profile |
Use existing Chrome user profile for authenticated sessions |
--browser-profile NAME |
Chrome profile directory name (default: "Default", requires --use-browser-profile) |
Note: --verbose and --quiet are mutually exclusive.
# Basic usage
uv run python -m pilot.main
# With voice input enabled
uv run python -m pilot.main --voice-input
# Verbose mode for debugging
uv run python -m pilot.main --verbose
# Quiet mode (minimal output)
uv run python -m pilot.main --quiet
# Use Chrome profile for authenticated sessions
uv run python -m pilot.main --use-browser-profile --browser-profile "Profile 1"| Key | Action |
|---|---|
| F5 | Toggle between voice and text input modes |
| ESC | Cancel current task immediately |
| Ctrl+C | Quit the application |
| Ctrl+J / Alt+Enter | Insert newline in text input (multiline tasks) |
While in the interactive prompt, you can use:
| Command | Aliases | Action |
|---|---|---|
quit |
exit, q |
Exit the application |
history |
h, recent |
Show last 10 tasks for quick re-selection |
Here's what a typical session looks like:
$ uv run python -m pilot.main
══════════════════════════════════════════════════════════════════
Computer Use Agent
══════════════════════════════════════════════════════════════════
Platform: macOS (ARM64) │ LLM: gemini-2.0-flash-exp
Voice: disabled │ Twilio: not configured
══════════════════════════════════════════════════════════════════
What would you like me to do?
➤ Open Calculator and compute 25 × 36
┌─ Manager Agent ──────────────────────────────── ● ACTIVE ─┐
┊ Analyzing task: Need to open Calculator app and perform calculation
→ Delegating to GUI Agent: Open Calculator and compute 25 × 36
└─ Manager Agent ───────────────────────── COMPLETE ─┘
┌─ GUI Agent ──────────────────────────────── ● ACTIVE ─┐
• Opening Calculator application
⟳ open_application
→ app_name: Calculator
✓ open_application (0.85s)
← Application opened successfully
• Typing calculation: 25 * 36 =
⟳ type_text
→ text: 25 * 36 =
✓ type_text (0.12s)
← Text typed successfully
└─ GUI Agent ───────────────────────── COMPLETE ─┘
│ Duration: 2s │ Tools: 2/2 │
✅ Result: Calculator opened and calculation completed. Result: 900
══════════════════════════════════════════════════════════════════
SESSION SUMMARY
══════════════════════════════════════════════════════════════════
Duration: 3s │ Tools: 2/2 │ Tokens: 45↑ 12↓ │ Agents: 2
Time Breakdown: LLM 1s (33%) │ Tools 1s (33%) │ Other 1s (33%) │ LLM Calls: 2
══════════════════════════════════════════════════════════════════
What would you like me to do?
➤ _Create a .env file in the project root:
# Required: LLM provider and model
LLM_PROVIDER=google
LLM_MODEL=gemini-2.0-flash-exp
GOOGLE_API_KEY=your_key_here
# Vision LLM (for screenshot analysis)
VISION_LLM_PROVIDER=google
VISION_LLM_MODEL=gemini-2.0-flash-exp
# Browser LLM (for web automation)
BROWSER_LLM_PROVIDER=google
BROWSER_LLM_MODEL=gemini-2.0-flash-exp| Provider | Models | Environment Variable |
|---|---|---|
| gemini-2.0-flash-exp, gemini-pro | GOOGLE_API_KEY |
|
| OpenAI | gpt-4o, gpt-4-turbo | OPENAI_API_KEY |
| Anthropic | claude-3-5-sonnet, claude-3-opus | ANTHROPIC_API_KEY |
TWILIO_ACCOUNT_SID=your_sid
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+1234567890
WEBHOOK_PORT=5000 # Optional, defaults to 5000Enables SMS verification for account signups during web automation.
DEEPGRAM_API_KEY=your_deepgram_key
VOICE_INPUT_LANGUAGE=multi # or 'en', 'es', 'fr', etc.Enables voice-to-text input (100+ languages supported). Toggle with F5 key.
The system uses a hierarchical multi-agent architecture:
| Agent | Role | Description |
|---|---|---|
| Manager | Task Orchestration Manager | Analyzes requests, decomposes into subtasks, delegates to specialists |
| Browser | Web Automation Specialist | Web navigation, downloads, forms, phone verification via Browser-Use |
| GUI | Desktop Application Automation Expert | Native app control with multi-tier accuracy (accessibility → OCR → vision) |
| System | System Command & Terminal Expert | Shell commands, file operations, process management |
| Coding | Code Automation Specialist | Code writing, refactoring, bug fixes via Cline AI |
| Tool | Description |
|---|---|
open_application |
Launch and focus desktop applications |
get_accessible_elements |
Get interactive UI elements via accessibility APIs |
click_element |
Click elements using multi-tier detection |
type_text |
Keyboard input, shortcuts, and text entry |
read_screen_text |
Extract text from screen via OCR |
scroll |
Scroll content in applications |
get_window_image |
Capture specific window as image |
check_app_running |
Check if an application is running |
list_running_apps |
List all running applications |
request_human_input |
Escalate to human for CAPTCHAs, 2FA, or decisions |
find_application |
LLM-based app selection for a given capability |
take_screenshot |
Capture screenshots (full screen, region, or app window) |
Browser-Use Core Actions:
| Tool | Description |
|---|---|
go_to_url |
Navigate to a URL |
click_element |
Click on page elements |
input_text |
Type text into form fields |
scroll_down/up |
Scroll the page |
extract_content |
Extract text/data from pages |
upload_file |
Upload files to web forms |
screenshot |
Capture page screenshots |
wait |
Wait for elements or conditions |
Custom Browser Tools:
| Tool | Description |
|---|---|
paste_text |
Instant text paste via JavaScript (faster than typing) |
type_to_focused |
Type into canvas editors (Google Docs, Notion, Figma) |
get_phone_number |
Get Twilio number for SMS verification |
get_verify_code |
Wait for and retrieve SMS verification code |
request_human_help |
Request human help for CAPTCHAs, 2FA, QR codes |
generate_image |
Generate AI images using Gemini for ads/marketing |
delegate_to_gui |
Delegate OS-native dialogs to GUI agent |
| Tool | Description |
|---|---|
execute_shell_command |
Safe shell command execution with validation |
| Tool | Description |
|---|---|
coding_automation |
Autonomous code writing/modification via Cline AI |
| Tool | Description |
|---|---|
get_system_state |
Observe current system state (OPAV pattern) |
verify_app_focused |
Verify that a specific application is focused |
flowchart TB
%% =========================
%% GitHub-friendly styling
%% =========================
classDef io fill:#FFF7ED,stroke:#C2410C,stroke-width:1.5px,color:#111827;
classDef orchestrator fill:#F0FDF4,stroke:#16A34A,stroke-width:1.5px,color:#052E16;
classDef step fill:#EEF2FF,stroke:#4338CA,stroke-width:1.5px,color:#111827;
classDef tool fill:#F8FAFC,stroke:#94A3B8,stroke-width:1px,color:#0F172A;
classDef terminator fill:#111827,stroke:#111827,color:#F9FAFB;
linkStyle default stroke:#64748B,stroke-width:1.25px;
%% =========================
%% Orchestration
%% =========================
U([User Request]):::io --> MGR[Manager Agent]:::orchestrator
MGR --> AN[Analyze]:::step --> PL[Plan]:::step --> DE[Delegate]:::step
DE --> DISPATCH[Dispatch to specialists]:::orchestrator
DISPATCH --> R([Results]):::terminator
%% =========================
%% Specialists (stacked)
%% =========================
DISPATCH --> BROW_START
DISPATCH --> GUI_START
DISPATCH --> SYS_START
DISPATCH --> CODE_START
subgraph SA[Specialist Agents]
direction TB
%% ---------- Browser ----------
subgraph BROW[Browser Agent]
direction TB
BROW_START(( )):::orchestrator
subgraph BROW_A[Input & Typing]
direction LR
B1[paste_text]:::tool
B2[type_to_focused]:::tool
end
subgraph BROW_B[Auth]
direction LR
B3[get_phone_number]:::tool
B4[get_verify_code]:::tool
end
subgraph BROW_C[Escalation & Handoff]
direction LR
B5[request_human_help]:::tool
B7[delegate_to_gui]:::tool
end
subgraph BROW_D[Media]
direction LR
B6[generate_image]:::tool
end
end
%% ---------- GUI ----------
subgraph GUI[GUI Agent]
direction TB
GUI_START(( )):::orchestrator
subgraph GUI_A[Core UI Controls]
direction LR
G1[open_application]:::tool
G3[click_element]:::tool
G4[type_text]:::tool
end
subgraph GUI_B[Read & Navigate]
direction LR
G2[get_accessible_elements]:::tool
G5[read_screen_text]:::tool
G6[scroll]:::tool
end
subgraph GUI_C[Window & Process]
direction LR
G7[get_window_image]:::tool
G8[check_app_running]:::tool
G9[list_running_apps]:::tool
end
subgraph GUI_D[Discovery & Human]
direction LR
G11[find_application]:::tool
G10[request_human_input]:::tool
end
end
%% ---------- System ----------
subgraph SYS[System Agent]
direction TB
SYS_START(( )):::orchestrator
S1[execute_shell_command]:::tool
end
%% ---------- Coding ----------
subgraph CODE[Coding Agent]
direction TB
CODE_START(( )):::orchestrator
C1[coding_automation]:::tool
end
end
- Manager Agent receives your request and breaks it into subtasks
- Specialist agents execute their portion using appropriate tools
- Context flows automatically between agents (file paths, data, state)
- Multi-tier accuracy ensures reliable element detection across platforms
For detailed architecture, see docs/ARCHITECTURE_OVERVIEW.md and docs/CREWAI_ARCHITECTURE.md.
Use the ComputerUseCrew class for programmatic control:
import asyncio
from pilot.crew import ComputerUseCrew
from pilot.utils.platform_detector import detect_platform
from pilot.utils.safety_checker import SafetyChecker
async def main():
# Initialize crew
crew = ComputerUseCrew(
capabilities=detect_platform(),
safety_checker=SafetyChecker(),
)
# Execute a task
result = await crew.execute_task("Open Safari and search for Python tutorials")
# Check result
print(f"Success: {result.overall_success}")
print(f"Result: {result.result}")
if result.error:
print(f"Error: {result.error}")
asyncio.run(main())conversation_history = []
# First task
result1 = await crew.execute_task(
"Download a PDF from example.com",
conversation_history=conversation_history
)
# Second task (can reference previous results)
result2 = await crew.execute_task(
"Open that PDF in Preview",
conversation_history=conversation_history
)crew = ComputerUseCrew(
capabilities=detect_platform(),
safety_checker=SafetyChecker(),
use_browser_profile=True,
browser_profile_directory="Profile 1", # Chrome profile name
)class TaskExecutionResult:
task: str # Original task description
result: str | None # Execution result text (None if failed)
overall_success: bool # Whether execution succeeded
error: str | None # Error message if failedDownload HD image of Cristiano Ronaldo
Search for Tesla stock price and save to file
Sign up for account on website with phone verification
Open Calculator and compute 1234 × 5678
Create new document in TextEdit with content "Hello World"
Open System Settings and change theme to dark mode
Create folder named "reports" in Documents
Move all PDF files from Downloads to Documents
List all Python files in current directory
Create a snake game in Python using pygame
Write unit tests for the user authentication module
Refactor the database queries to use async/await
Download census data from census.gov and create chart in Excel
Research fashion trends online and create summary in TextEdit
Scrape product prices from Amazon and generate a comparison report
- Deleting system files or protected directories
- Formatting drives
- Operations on system folders (
/etc,/usr,C:\Windows)
- Any file deletion
- Commands with destructive potential (
rm,del,format) - Operations outside the user's home directory
| Control | Action |
|---|---|
| ESC | Cancel current task immediately |
| Confirmation prompts | Approve or reject dangerous operations |
LOG_LEVEL=DEBUG |
Enable verbose logging |
- Open System Settings → Privacy & Security → Accessibility
- Click the lock icon and authenticate
- Click "+" and add Terminal (or your IDE)
- Restart Terminal/IDE
Run PowerShell or Terminal as Administrator (right-click → "Run as Administrator").
sudo apt-get install -y python3-pyatspi python3-xlibEnsure you're running X11 (Wayland support is limited).
The system defaults to EasyOCR. Alternative engines:
| Engine | Platform | Notes |
|---|---|---|
| EasyOCR | All | Default, 80+ languages |
| PaddleOCR | All | Lighter, faster |
| macOS Vision | macOS only | Native, high accuracy |
- Verify no quotes around the key value in
.env - Test the key in the provider's console
- Confirm correct environment variable name
| Issue | Solution |
|---|---|
| "DEEPGRAM_API_KEY not found" | Add DEEPGRAM_API_KEY=... to .env |
| "No microphone detected" | Check system permissions and hardware |
| Language detection failing | Set VOICE_INPUT_LANGUAGE=multi for auto-detection |
| Version | Status |
|---|---|
| 3.11+ | Required |
| Platform | Version | Accessibility API | Status |
|---|---|---|---|
| macOS | 10.14+ (Mojave) | NSAccessibility via atomacos | Full support |
| Windows | 10+ | UI Automation via pywinauto | Full support |
| Linux | Ubuntu 20.04+, Debian 11+ | AT-SPI via pyatspi | Full support |
git clone https://github.com/lahfir/pilot.git
cd pilot
uv sync --dev --extra macos # or windows/linuxuv run pytestuv run ruff check .
uv run ruff check --fix .
uv run ruff format .- Maximum 400 lines per file
- Docstrings only (no inline comments)
- Type hints required
- All files must pass ruff
| Operation | Typical Time |
|---|---|
| Task analysis | 1-2s |
| Accessibility detection | <100ms |
| OCR detection | 200-500ms |
| Vision AI fallback | 1-3s |
| Click/type execution | <50ms |
| Full browser workflow | 10-30s |
MIT License
Built on these excellent open-source projects:
- CrewAI — Multi-agent orchestration framework
- Browser-Use — Web automation engine
- Cline — Autonomous AI coding agent
- LangChain — LLM integration framework
- EasyOCR — OCR with 80+ language support
- OpenCV — Computer vision
- PyAutoGUI — Cross-platform input control
- atomacos (macOS), pywinauto (Windows), pyatspi (Linux) — Platform accessibility APIs
- Deepgram — Voice-to-text transcription
- Twilio — SMS verification