Agent-Browser Technical Analysis for CODITECT Integration
Task: H.17.1.7 | Track: H (Framework Autonomy)
Date: 2026-02-08
Author: Claude (Opus 4.6)
Source: submodules/labs/agent-browser (commit 4d8097a)
Repository: https://github.com/vercel-labs/agent-browser
Executive Summary
Agent-browser is a hybrid Rust CLI + Node.js Playwright daemon providing 154 browser automation commands via a JSON DSL protocol. It achieves 93% token reduction vs raw HTML through an accessibility-tree snapshot engine with element references (@e1, @e2). The architecture maps cleanly to CODITECT's agent/skill/hook/command framework with zero blockers for integration.
Key Numbers:
- 154 commands across 26 categories
- Rust CLI: <50ms boot, 5 retries with exponential backoff
- Snapshot engine: ~200-400 tokens vs 3000-5000 for raw HTML
- Supported engines: Chromium, Firefox, WebKit, iOS Safari
- Remote providers: Browserbase, Kernel, Browser Use
1. Architecture Overview
+-------------------+
| CODITECT Agent |
| (coditect-browser|
| -agent.md) |
+--------+----------+
|
JSON DSL (newline-delimited)
|
+------------------v------------------+
| Rust CLI Binary |
| cli/src/main.rs (530 lines) |
| - Command parsing (commands.rs) |
| - Flag parsing (flags.rs) |
| - IPC connection (connection.rs) |
+------------------+------------------+
|
Unix Domain Socket (macOS/Linux)
TCP Port (Windows)
|
+------------------v------------------+
| Node.js Daemon |
| src/daemon.ts (453 lines) |
| - IPC server |
| - Session management |
| - Command dispatch |
+------------------+------------------+
|
+------------------v------------------+
| Playwright Core |
| src/browser.ts (1902 lines) |
| - Multi-engine support |
| - Multi-tab/window management |
| - CDP integration |
| - Screencast/input injection |
+------------------+------------------+
|
+--------+---------+---------+--------+
| | | | |
Chromium Firefox WebKit iOS Safari Remote
(Appium) Providers
Component Summary
| Component | File | Lines | Purpose |
|---|---|---|---|
| CLI entry | cli/src/main.rs | 530 | Arg parsing, daemon lifecycle |
| Commands | cli/src/commands.rs | 895 | 40+ CLI command handlers |
| Flags | cli/src/flags.rs | 183 | Two-phase flag parsing (env + CLI) |
| IPC | cli/src/connection.rs | 557 | Socket/TCP, retry logic, daemon start |
| Daemon | src/daemon.ts | 453 | IPC server, session management |
| Browser | src/browser.ts | 1902 | Playwright lifecycle, state tracking |
| Actions | src/actions.ts | 2045 | 154 command handlers |
| Protocol | src/protocol.ts | 977 | Zod schema validation |
| Types | src/types.ts | 1075 | TypeScript command/response types |
| Snapshot | src/snapshot.ts | 618 | Accessibility tree + element refs |
| Stream | src/stream-server.ts | 382 | WebSocket screencast + input |
| iOS | src/ios-manager.ts | 1299 | Appium/WebdriverIO Safari |
2. Rust CLI Architecture (H.17.1.2)
Flag Parsing (Two-Phase)
Phase 1 - Environment Variables (Priority): flags.rs:38-66
AGENT_BROWSER_SESSION(default: "default")AGENT_BROWSER_EXECUTABLE_PATH,AGENT_BROWSER_EXTENSIONSAGENT_BROWSER_PROFILE,AGENT_BROWSER_STATEAGENT_BROWSER_PROXY,AGENT_BROWSER_PROXY_BYPASSAGENT_BROWSER_PROVIDER,AGENT_BROWSER_IOS_DEVICE
Phase 2 - CLI Arguments: flags.rs:80-182
- Supported:
--json,--full/-f,--headed,--debug,--session,--executable-path,--extension,--cdp,--profile,--state,--proxy,--user-agent,-p/--provider,--device - Tracking:
cli_*_pathbooleans warn when flags ignored due to running daemon
Command Dispatch
commands.rs:81-895 - Match-based dispatch to 40+ CLI command handlers.
Categories:
- Navigation:
open/goto/navigate,back,forward,reload - Core Actions:
click,dblclick,fill,type,hover,focus,check,select,drag,upload - Keyboard:
press/key,keydown,keyup - Scroll:
scroll,scrollintoview - Wait: Complex multi-flag (
--url,--load,--fn,--text,--download) - Evaluation:
evalwith optional base64 encoding - Session:
close/quit/exit,connect(CDP) - Queries:
get,is,find,mouse,set,network - Data:
cookies,storage
IPC Mechanism
Socket Resolution (Ordered Priority): connection.rs:86-108
AGENT_BROWSER_SOCKET_DIRenv varXDG_RUNTIME_DIR(Linux:/run/user/1000/agent-browser)~/.agent-browserhome directory fallbackenv::temp_dir()last resort
Platform-Specific:
- Unix/macOS:
{socket_dir}/{session}.sock(Unix domain socket, max 103 bytes) - Windows: TCP on hash-derived port (formula:
49152 + ((hash % 16383) as u16))
Protocol: Newline-delimited JSON. Read timeout: 30s, Write timeout: 5s.
Retry Logic
connection.rs:484-513 - 5 retries, 200ms exponential backoff.
Transient Error Detection (connection.rs:521-535):
- macOS:
os error 35(EAGAIN),54(reset),61(refused) - Linux:
os error 11(EAGAIN),104(reset),111(refused) - Cross-platform:
WouldBlock,EOF, empty JSON,Broken pipe
Daemon Lifecycle
ensure_daemon() (connection.rs:206-465):
- Check if daemon running (double-check with 150ms sleep for race condition)
- Clean stale
.sock/.pidfiles - Validate socket path length (103 bytes max)
- Test directory writeability
- Fork+detach:
libc::setsid()(Unix),CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS(Windows) - Readiness polling: 50 iterations x 100ms = 5s timeout
Binary Distribution
- GitHub releases per platform:
agent-browser-{os}-{arch}[.exe] - npm
postinstall.jsdownloads binary, patches global npm shims to bypass Node.js wrapper - Cargo release profile:
opt-level=3,lto=true,codegen-units=1,strip=true
3. Node.js Daemon Architecture (H.17.1.3)
Daemon Entry (daemon.ts)
- Creates
net.Serverlistening on Unix socket or TCP port - Writes PID file for lifecycle management
- Rejects HTTP requests (security: detects
GET/POST/PUT/...pattern) - Per-command try/catch with graceful error responses
- Signal handlers:
SIGINT,SIGTERM,SIGHUPfor cleanup uncaughtException/unhandledRejectionhandlers clean socket before exit
Browser Manager (browser.ts)
State Tracking:
contexts: BrowserContext[],pages: Page[],activePageIndex: numberrefMap: RefMap,consoleMessages,pageErrorscdpSession: CDPSession | null
Multi-Engine Support: Chromium (default), Firefox, WebKit via Playwright.
- Extensions: Chromium only
- File access (
--allow-file-access): Chromium only
Remote Providers:
- Browserbase (
browser.ts:743-799):BROWSERBASE_API_KEY+BROWSERBASE_PROJECT_ID - Kernel (
browser.ts:850-940):KERNEL_API_KEY - Browser Use (
browser.ts:946-1013):BROWSER_USE_API_KEY
Session Isolation
One daemon per session. Each session has:
- Dedicated socket/port/PID files
- Independent BrowserManager instance
- Separate cookie/storage/auth state
Multi-Tab Within Session:
newTab(): New page in first contextnewWindow(): New context with separate pageswitchTo(index): Switch active pagecloseTab(index): Close specific tab
CDP Integration (browser.ts:1437-1605)
- Screencast:
Page.startScreencastvia CDP (JPEG/PNG frames, configurable quality/resolution) - Mouse injection:
Input.dispatchMouseEvent(mousePressed/Released/Moved/Wheel) - Keyboard injection:
Input.dispatchKeyEvent(keyDown/keyUp/char) - Touch injection:
Input.dispatchTouchEvent(touchStart/End/Move/Cancel)
WebSocket Stream Server (stream-server.ts)
- Port 9223 default
- Origin validation: rejects browser origins (prevents CORS bypass)
- Message types: Frame, InputMouse, InputKeyboard, InputTouch, Status, Error
4. JSON DSL Protocol (H.17.1.4)
Envelope Format
Request:
{"id": "r123456", "action": "click", "selector": "@e1"}
Success Response:
{"id": "r123456", "success": true, "data": {"clicked": true}}
Error Response:
{"id": "r123456", "success": false, "error": "Element \"@e1\" is blocked by another element"}
Validation
Zod discriminated union (protocol.ts:796-922) validates all 154 commands at runtime. parseCommand() returns typed result or validation error with field paths.
Command Catalog (154 Commands, 26 Categories)
| Category | Count | Key Commands |
|---|---|---|
| Session/Lifecycle | 8 | launch, close, tab_new, tab_list, tab_switch, connect |
| Navigation | 6 | navigate, back, forward, reload, url, title |
| Element Interaction | 24 | click, type, fill, press, hover, check, select, drag, upload |
| Element State | 8 | gettext, getattribute, getvalue, isvisible, isenabled, ischecked |
| Element Measurement | 4 | count, boundingbox, styles, content |
| Frame Handling | 2 | frame, mainframe |
| Semantic Locators | 7 | getbyrole, getbytext, getbylabel, getbyplaceholder, getbytestid |
| Position Selection | 1 | nth |
| Wait Operations | 5 | wait, waitforurl, waitforloadstate, waitforfunction, waitfordownload |
| Cookies | 3 | cookies_get, cookies_set, cookies_clear |
| Storage | 3 | storage_get, storage_set, storage_clear |
| Network | 4 | route, unroute, requests, responsebody |
| Dialog Handling | 1 | dialog |
| Emulation | 8 | viewport, device, useragent, geolocation, permissions, timezone |
| HTTP/Headers | 2 | headers, offline |
| Media Emulation | 1 | emulatemedia |
| Download/PDF | 2 | download, pdf |
| Screenshots/Snapshots | 2 | screenshot, snapshot |
| JS Execution | 6 | evaluate, evalhandle, addscript, addstyle, addinitscript, expose |
| Debugging | 4 | console, errors, highlight, pause |
| Video/Recording | 5 | video_start/stop, recording_start/stop/restart |
| Tracing/HAR | 4 | trace_start/stop, har_start/stop |
| State Persistence | 2 | state_save, state_load |
| Mouse Control | 5 | mousemove, mousedown, mouseup, wheel, bringtofront |
| Streaming/Input | 5 | screencast_start/stop, input_mouse, input_keyboard, input_touch |
| iOS-Specific | 2 | swipe, device_list |
AI-Friendly Error Translation (actions.ts:151-204)
Playwright errors are converted to actionable AI messages:
- Multiple matches: "Selector matched N elements. Run 'snapshot' to get updated refs."
- Blocked by overlay: "Element blocked by another element. Try dismissing modals/cookie banners."
- Not visible: "Element not visible. Try scrolling into view."
- Timeout: "Action timed out. Run 'snapshot' to check current page state."
- Not found: "Element not found. Run 'snapshot' to see current page elements."
5. Snapshot Engine (H.17.1.5)
How It Works
- Calls Playwright's
ariaSnapshot()on root or scoped CSS selector - Processes accessibility tree line-by-line (O(n) single pass)
- Assigns auto-incrementing refs (
e1,e2, ...) to interactive/named-content elements - Returns enhanced tree text +
RefMapfor subsequent commands
Element Reference System
Ref format: @e1, @e2, etc.
- Generated per-snapshot (counter resets each time)
- Cached in
BrowserManager.refMapuntil next snapshot - Invalidated on page navigation (must re-snapshot)
- Resolution:
browser.getLocator("@e1")-> PlaywrightgetByRole()with exact name match
RefMap Structure:
interface RefMap {
[ref: string]: {
selector: string; // e.g., "getByRole('button', { name: \"Submit\", exact: true })"
role: string; // e.g., 'button', 'link', 'textbox'
name?: string; // e.g., "Submit"
nth?: number; // Disambiguation index (only for duplicates)
};
}
Filtering Modes
| Mode | Flag | Effect |
|---|---|---|
| Interactive only | -i | Only buttons, links, inputs, etc. (17 ARIA roles) |
| Compact | -c | Removes unnamed structural elements without ref-containing children |
| Depth limit | -d N | Cuts tree at depth N |
| CSS scope | -s "selector" | Scopes to CSS selector subtree |
| Cursor detection | --cursor | Detects cursor:pointer, onclick, tabindex elements |
Performance
- 93% token reduction: ~200-400 tokens vs 3000-5000 for raw HTML
- Single-pass O(n) line processing
- Duplicate handling:
RoleNameTrackeraddsnthonly when >1 match
6. Capability Mapping to CODITECT Patterns (H.17.1.6)
Agent Mapping
| agent-browser Feature | CODITECT Agent | Integration Point |
|---|---|---|
| Browser automation (154 commands) | coditect-browser-agent.md | Primary agent for all browser tasks |
| Screenshot/snapshot | frontend-development-agent | Visual testing, component screenshots |
| Network interception | api-integration-specialist | API mocking, request capture |
| Accessibility tree | accessibility-testing-specialist | WCAG compliance scanning |
| Session management | multi-agent-coordinator | Multi-browser session orchestration |
| Error translation | debugger | Browser error diagnosis |
| iOS automation | mobile-testing-specialist | Cross-platform mobile testing |
Skill Mapping
| agent-browser Capability | CODITECT Skill | Track |
|---|---|---|
| Browser control patterns | browser-automation-patterns/SKILL.md | H.17 |
| Snapshot + ref system | Extension of memory-context-patterns/SKILL.md | J |
| JSON DSL protocol | Extension of api-design-patterns/SKILL.md | A |
| Binary distribution | Extension of binary-distribution-patterns/SKILL.md | C |
| Error recovery | Extension of error-handling-resilience/SKILL.md | H |
| State persistence | Extension of cloud-native-patterns/SKILL.md | C |
Hook Mapping
| Hook | Trigger | Purpose |
|---|---|---|
browser-auto-launch.py | PreToolUse:Bash | Auto-launch daemon when browser commands detected |
browser-screenshot-on-error.py | PostToolUse:Bash | Auto-screenshot on page/navigation errors |
browser-snapshot-cache.py | PostToolUse:Bash | Cache last snapshot in context for /cxq queries |
browser-session-cleanup.py | SessionEnd | Clean up daemon processes on session end |
Command Mapping
| CODITECT Command | Implementation | Purpose |
|---|---|---|
/browser navigate <url> | navigate action | Open URL in browser |
/browser click <selector> | click action | Click element |
/browser snapshot | snapshot action | Get page accessibility tree |
/browser screenshot [path] | screenshot action | Capture screenshot |
/browser fill <selector> <value> | fill action | Fill form field |
/browser eval <script> | evaluate action | Execute JavaScript |
/browser session list | session list | List active browser sessions |
/browser close | close action | Close browser |
MCP Server Integration
Expose browser tools to all CODITECT agents via MCP:
browser_navigate- Navigate to URLbrowser_click- Click element by ref or selectorbrowser_snapshot- Get page accessibility tree with refsbrowser_screenshot- Capture screenshotbrowser_fill- Fill form fieldbrowser_evaluate- Execute JavaScript on page
Context System Integration
| Feature | Integration |
|---|---|
/cx (capture context) | Captures current page URL, title, snapshot, console errors |
/cxq "search" (query context) | Queries cached browser snapshots for element/content search |
/session-log | Logs browser actions with timestamps and screenshots |
| Message bus | Registers browser session in messaging.db for cross-LLM coordination |
MoE Agent Dispatcher Integration
Auto-route browser-related tasks to coditect-browser-agent:
- Keywords:
browser,webpage,click,navigate,screenshot,form,login,scrape - Confidence threshold: 0.7 for automatic routing
- Fallback:
senior-architectfor complex browser automation workflows
7. Implementation Recommendations
Phase 1: Core (H.17.2-H.17.3) - Estimated 30-45 hours
- Fork protocol layer from agent-browser (Zod schemas, types, protocol)
- Implement coditect-browser daemon wrapping Playwright with CODITECT session integration
- Implement Rust CLI based on agent-browser patterns (reuse IPC/retry logic)
- Add CODITECT-specific commands:
/cxintegration, session-bus registration, MoE routing
Phase 2: Framework Integration (H.17.4) - Estimated 15-20 hours
- Create agent, skill, command, hooks
- MCP server for tool exposure
- Context system integration (
/cx,/cxq) - MoE dispatcher keyword routing
Phase 3: Testing & Docs (H.17.5) - Estimated 7-12 hours
- Protocol validation tests
- E2E browser workflow tests
- Performance benchmarks
- User documentation
Key Design Decisions
| Decision | Recommendation | Rationale |
|---|---|---|
| Fork vs wrap | Wrap agent-browser as dependency | Faster, maintained upstream, MIT license |
| Protocol extension | Extend JSON DSL with CODITECT ops | Backward compatible, reuse Zod schemas |
| Binary distribution | npm + native optional deps | Follows agent-browser pattern, proven |
| Snapshot storage | Cache in BrowserManager + context.db | Enables /cxq queries across sessions |
| Session isolation | One daemon per session | Proven isolation, no cross-contamination |
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Playwright version drift | Medium | Pin playwright-core version, test on upgrade |
| Binary build CI complexity | Low | Reuse agent-browser's build matrix |
| Token budget for snapshots | Low | Already 93% reduced; -i flag for minimal |
| CDP API changes | Low | Only used for screencast/input (optional) |
| iOS automation complexity | Medium | Defer to Phase 2+; not needed for PILOT |
8. Dependencies
Runtime
| Package | Version | Purpose |
|---|---|---|
| playwright-core | ^1.57.0 | Browser automation |
| zod | ^3.22.4 | Schema validation |
| ws | ^8.19.0 | WebSocket (stream server) |
| node-simctl | ^7.4.0 | iOS simulator control |
| webdriverio | ^9.15.0 | iOS automation via Appium |
Rust CLI Crates
| Crate | Version | Purpose |
|---|---|---|
| serde | 1.0 | JSON serialization |
| serde_json | 1.0 | JSON parsing |
| dirs | 5.0 | Cross-platform home directory |
| base64 | 0.22 | Script encoding |
| libc | 0.2 | Unix syscalls |
| windows-sys | 0.52 | Win32 process management |
9. Test Coverage
| Test File | Lines | Coverage |
|---|---|---|
protocol.test.ts | 1076 | All 154 command schemas |
browser.test.ts | 744 | Browser launch, tab/window management |
ios-manager.test.ts | 157 | iOS device listing, session management |
daemon.test.ts | 96 | HTTP detection, socket directory resolution |
actions.test.ts | 39 | AI-friendly error translation |
Appendix: Key File Paths
All paths relative to submodules/labs/agent-browser/:
| File | Purpose |
|---|---|
cli/src/main.rs | CLI entry point |
cli/src/commands.rs | Command parsing and dispatch |
cli/src/flags.rs | Two-phase flag parsing |
cli/src/connection.rs | IPC, daemon lifecycle, retry |
cli/Cargo.toml | Rust dependencies |
src/daemon.ts | Node.js IPC server |
src/browser.ts | Playwright browser manager |
src/actions.ts | 154 command handlers |
src/protocol.ts | Zod validation schemas |
src/types.ts | TypeScript types |
src/snapshot.ts | Snapshot engine |
src/stream-server.ts | WebSocket stream server |
src/ios-manager.ts | iOS automation |
skills/agent-browser/references/snapshot-refs.md | Ref system documentation |
skills/agent-browser/references/commands.md | CLI command reference |