Skip to main content

Agent-Browser Technical Analysis for CODITECT Integration

Task: H.17.1.7 | Track: H (Framework Autonomy) Date: 2026-02-08 Author: Claude (Opus 4.6) Source: submodules/labs/agent-browser (commit 4d8097a) Repository: https://github.com/vercel-labs/agent-browser


Executive Summary

Agent-browser is a hybrid Rust CLI + Node.js Playwright daemon providing 154 browser automation commands via a JSON DSL protocol. It achieves 93% token reduction vs raw HTML through an accessibility-tree snapshot engine with element references (@e1, @e2). The architecture maps cleanly to CODITECT's agent/skill/hook/command framework with zero blockers for integration.

Key Numbers:

  • 154 commands across 26 categories
  • Rust CLI: <50ms boot, 5 retries with exponential backoff
  • Snapshot engine: ~200-400 tokens vs 3000-5000 for raw HTML
  • Supported engines: Chromium, Firefox, WebKit, iOS Safari
  • Remote providers: Browserbase, Kernel, Browser Use

1. Architecture Overview

                        +-------------------+
| CODITECT Agent |
| (coditect-browser|
| -agent.md) |
+--------+----------+
|
JSON DSL (newline-delimited)
|
+------------------v------------------+
| Rust CLI Binary |
| cli/src/main.rs (530 lines) |
| - Command parsing (commands.rs) |
| - Flag parsing (flags.rs) |
| - IPC connection (connection.rs) |
+------------------+------------------+
|
Unix Domain Socket (macOS/Linux)
TCP Port (Windows)
|
+------------------v------------------+
| Node.js Daemon |
| src/daemon.ts (453 lines) |
| - IPC server |
| - Session management |
| - Command dispatch |
+------------------+------------------+
|
+------------------v------------------+
| Playwright Core |
| src/browser.ts (1902 lines) |
| - Multi-engine support |
| - Multi-tab/window management |
| - CDP integration |
| - Screencast/input injection |
+------------------+------------------+
|
+--------+---------+---------+--------+
| | | | |
Chromium Firefox WebKit iOS Safari Remote
(Appium) Providers

Component Summary

ComponentFileLinesPurpose
CLI entrycli/src/main.rs530Arg parsing, daemon lifecycle
Commandscli/src/commands.rs89540+ CLI command handlers
Flagscli/src/flags.rs183Two-phase flag parsing (env + CLI)
IPCcli/src/connection.rs557Socket/TCP, retry logic, daemon start
Daemonsrc/daemon.ts453IPC server, session management
Browsersrc/browser.ts1902Playwright lifecycle, state tracking
Actionssrc/actions.ts2045154 command handlers
Protocolsrc/protocol.ts977Zod schema validation
Typessrc/types.ts1075TypeScript command/response types
Snapshotsrc/snapshot.ts618Accessibility tree + element refs
Streamsrc/stream-server.ts382WebSocket screencast + input
iOSsrc/ios-manager.ts1299Appium/WebdriverIO Safari

2. Rust CLI Architecture (H.17.1.2)

Flag Parsing (Two-Phase)

Phase 1 - Environment Variables (Priority): flags.rs:38-66

  • AGENT_BROWSER_SESSION (default: "default")
  • AGENT_BROWSER_EXECUTABLE_PATH, AGENT_BROWSER_EXTENSIONS
  • AGENT_BROWSER_PROFILE, AGENT_BROWSER_STATE
  • AGENT_BROWSER_PROXY, AGENT_BROWSER_PROXY_BYPASS
  • AGENT_BROWSER_PROVIDER, AGENT_BROWSER_IOS_DEVICE

Phase 2 - CLI Arguments: flags.rs:80-182

  • Supported: --json, --full/-f, --headed, --debug, --session, --executable-path, --extension, --cdp, --profile, --state, --proxy, --user-agent, -p/--provider, --device
  • Tracking: cli_*_path booleans warn when flags ignored due to running daemon

Command Dispatch

commands.rs:81-895 - Match-based dispatch to 40+ CLI command handlers.

Categories:

  1. Navigation: open/goto/navigate, back, forward, reload
  2. Core Actions: click, dblclick, fill, type, hover, focus, check, select, drag, upload
  3. Keyboard: press/key, keydown, keyup
  4. Scroll: scroll, scrollintoview
  5. Wait: Complex multi-flag (--url, --load, --fn, --text, --download)
  6. Evaluation: eval with optional base64 encoding
  7. Session: close/quit/exit, connect (CDP)
  8. Queries: get, is, find, mouse, set, network
  9. Data: cookies, storage

IPC Mechanism

Socket Resolution (Ordered Priority): connection.rs:86-108

  1. AGENT_BROWSER_SOCKET_DIR env var
  2. XDG_RUNTIME_DIR (Linux: /run/user/1000/agent-browser)
  3. ~/.agent-browser home directory fallback
  4. env::temp_dir() last resort

Platform-Specific:

  • Unix/macOS: {socket_dir}/{session}.sock (Unix domain socket, max 103 bytes)
  • Windows: TCP on hash-derived port (formula: 49152 + ((hash % 16383) as u16))

Protocol: Newline-delimited JSON. Read timeout: 30s, Write timeout: 5s.

Retry Logic

connection.rs:484-513 - 5 retries, 200ms exponential backoff.

Transient Error Detection (connection.rs:521-535):

  • macOS: os error 35 (EAGAIN), 54 (reset), 61 (refused)
  • Linux: os error 11 (EAGAIN), 104 (reset), 111 (refused)
  • Cross-platform: WouldBlock, EOF, empty JSON, Broken pipe

Daemon Lifecycle

ensure_daemon() (connection.rs:206-465):

  1. Check if daemon running (double-check with 150ms sleep for race condition)
  2. Clean stale .sock/.pid files
  3. Validate socket path length (103 bytes max)
  4. Test directory writeability
  5. Fork+detach: libc::setsid() (Unix), CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS (Windows)
  6. Readiness polling: 50 iterations x 100ms = 5s timeout

Binary Distribution

  • GitHub releases per platform: agent-browser-{os}-{arch}[.exe]
  • npm postinstall.js downloads binary, patches global npm shims to bypass Node.js wrapper
  • Cargo release profile: opt-level=3, lto=true, codegen-units=1, strip=true

3. Node.js Daemon Architecture (H.17.1.3)

Daemon Entry (daemon.ts)

  • Creates net.Server listening on Unix socket or TCP port
  • Writes PID file for lifecycle management
  • Rejects HTTP requests (security: detects GET/POST/PUT/... pattern)
  • Per-command try/catch with graceful error responses
  • Signal handlers: SIGINT, SIGTERM, SIGHUP for cleanup
  • uncaughtException/unhandledRejection handlers clean socket before exit

Browser Manager (browser.ts)

State Tracking:

  • contexts: BrowserContext[], pages: Page[], activePageIndex: number
  • refMap: RefMap, consoleMessages, pageErrors
  • cdpSession: CDPSession | null

Multi-Engine Support: Chromium (default), Firefox, WebKit via Playwright.

  • Extensions: Chromium only
  • File access (--allow-file-access): Chromium only

Remote Providers:

  • Browserbase (browser.ts:743-799): BROWSERBASE_API_KEY + BROWSERBASE_PROJECT_ID
  • Kernel (browser.ts:850-940): KERNEL_API_KEY
  • Browser Use (browser.ts:946-1013): BROWSER_USE_API_KEY

Session Isolation

One daemon per session. Each session has:

  • Dedicated socket/port/PID files
  • Independent BrowserManager instance
  • Separate cookie/storage/auth state

Multi-Tab Within Session:

  • newTab(): New page in first context
  • newWindow(): New context with separate page
  • switchTo(index): Switch active page
  • closeTab(index): Close specific tab

CDP Integration (browser.ts:1437-1605)

  • Screencast: Page.startScreencast via CDP (JPEG/PNG frames, configurable quality/resolution)
  • Mouse injection: Input.dispatchMouseEvent (mousePressed/Released/Moved/Wheel)
  • Keyboard injection: Input.dispatchKeyEvent (keyDown/keyUp/char)
  • Touch injection: Input.dispatchTouchEvent (touchStart/End/Move/Cancel)

WebSocket Stream Server (stream-server.ts)

  • Port 9223 default
  • Origin validation: rejects browser origins (prevents CORS bypass)
  • Message types: Frame, InputMouse, InputKeyboard, InputTouch, Status, Error

4. JSON DSL Protocol (H.17.1.4)

Envelope Format

Request:

{"id": "r123456", "action": "click", "selector": "@e1"}

Success Response:

{"id": "r123456", "success": true, "data": {"clicked": true}}

Error Response:

{"id": "r123456", "success": false, "error": "Element \"@e1\" is blocked by another element"}

Validation

Zod discriminated union (protocol.ts:796-922) validates all 154 commands at runtime. parseCommand() returns typed result or validation error with field paths.

Command Catalog (154 Commands, 26 Categories)

CategoryCountKey Commands
Session/Lifecycle8launch, close, tab_new, tab_list, tab_switch, connect
Navigation6navigate, back, forward, reload, url, title
Element Interaction24click, type, fill, press, hover, check, select, drag, upload
Element State8gettext, getattribute, getvalue, isvisible, isenabled, ischecked
Element Measurement4count, boundingbox, styles, content
Frame Handling2frame, mainframe
Semantic Locators7getbyrole, getbytext, getbylabel, getbyplaceholder, getbytestid
Position Selection1nth
Wait Operations5wait, waitforurl, waitforloadstate, waitforfunction, waitfordownload
Cookies3cookies_get, cookies_set, cookies_clear
Storage3storage_get, storage_set, storage_clear
Network4route, unroute, requests, responsebody
Dialog Handling1dialog
Emulation8viewport, device, useragent, geolocation, permissions, timezone
HTTP/Headers2headers, offline
Media Emulation1emulatemedia
Download/PDF2download, pdf
Screenshots/Snapshots2screenshot, snapshot
JS Execution6evaluate, evalhandle, addscript, addstyle, addinitscript, expose
Debugging4console, errors, highlight, pause
Video/Recording5video_start/stop, recording_start/stop/restart
Tracing/HAR4trace_start/stop, har_start/stop
State Persistence2state_save, state_load
Mouse Control5mousemove, mousedown, mouseup, wheel, bringtofront
Streaming/Input5screencast_start/stop, input_mouse, input_keyboard, input_touch
iOS-Specific2swipe, device_list

AI-Friendly Error Translation (actions.ts:151-204)

Playwright errors are converted to actionable AI messages:

  • Multiple matches: "Selector matched N elements. Run 'snapshot' to get updated refs."
  • Blocked by overlay: "Element blocked by another element. Try dismissing modals/cookie banners."
  • Not visible: "Element not visible. Try scrolling into view."
  • Timeout: "Action timed out. Run 'snapshot' to check current page state."
  • Not found: "Element not found. Run 'snapshot' to see current page elements."

5. Snapshot Engine (H.17.1.5)

How It Works

  1. Calls Playwright's ariaSnapshot() on root or scoped CSS selector
  2. Processes accessibility tree line-by-line (O(n) single pass)
  3. Assigns auto-incrementing refs (e1, e2, ...) to interactive/named-content elements
  4. Returns enhanced tree text + RefMap for subsequent commands

Element Reference System

Ref format: @e1, @e2, etc.

  • Generated per-snapshot (counter resets each time)
  • Cached in BrowserManager.refMap until next snapshot
  • Invalidated on page navigation (must re-snapshot)
  • Resolution: browser.getLocator("@e1") -> Playwright getByRole() with exact name match

RefMap Structure:

interface RefMap {
[ref: string]: {
selector: string; // e.g., "getByRole('button', { name: \"Submit\", exact: true })"
role: string; // e.g., 'button', 'link', 'textbox'
name?: string; // e.g., "Submit"
nth?: number; // Disambiguation index (only for duplicates)
};
}

Filtering Modes

ModeFlagEffect
Interactive only-iOnly buttons, links, inputs, etc. (17 ARIA roles)
Compact-cRemoves unnamed structural elements without ref-containing children
Depth limit-d NCuts tree at depth N
CSS scope-s "selector"Scopes to CSS selector subtree
Cursor detection--cursorDetects cursor:pointer, onclick, tabindex elements

Performance

  • 93% token reduction: ~200-400 tokens vs 3000-5000 for raw HTML
  • Single-pass O(n) line processing
  • Duplicate handling: RoleNameTracker adds nth only when >1 match

6. Capability Mapping to CODITECT Patterns (H.17.1.6)

Agent Mapping

agent-browser FeatureCODITECT AgentIntegration Point
Browser automation (154 commands)coditect-browser-agent.mdPrimary agent for all browser tasks
Screenshot/snapshotfrontend-development-agentVisual testing, component screenshots
Network interceptionapi-integration-specialistAPI mocking, request capture
Accessibility treeaccessibility-testing-specialistWCAG compliance scanning
Session managementmulti-agent-coordinatorMulti-browser session orchestration
Error translationdebuggerBrowser error diagnosis
iOS automationmobile-testing-specialistCross-platform mobile testing

Skill Mapping

agent-browser CapabilityCODITECT SkillTrack
Browser control patternsbrowser-automation-patterns/SKILL.mdH.17
Snapshot + ref systemExtension of memory-context-patterns/SKILL.mdJ
JSON DSL protocolExtension of api-design-patterns/SKILL.mdA
Binary distributionExtension of binary-distribution-patterns/SKILL.mdC
Error recoveryExtension of error-handling-resilience/SKILL.mdH
State persistenceExtension of cloud-native-patterns/SKILL.mdC

Hook Mapping

HookTriggerPurpose
browser-auto-launch.pyPreToolUse:BashAuto-launch daemon when browser commands detected
browser-screenshot-on-error.pyPostToolUse:BashAuto-screenshot on page/navigation errors
browser-snapshot-cache.pyPostToolUse:BashCache last snapshot in context for /cxq queries
browser-session-cleanup.pySessionEndClean up daemon processes on session end

Command Mapping

CODITECT CommandImplementationPurpose
/browser navigate <url>navigate actionOpen URL in browser
/browser click <selector>click actionClick element
/browser snapshotsnapshot actionGet page accessibility tree
/browser screenshot [path]screenshot actionCapture screenshot
/browser fill <selector> <value>fill actionFill form field
/browser eval <script>evaluate actionExecute JavaScript
/browser session listsession listList active browser sessions
/browser closeclose actionClose browser

MCP Server Integration

Expose browser tools to all CODITECT agents via MCP:

  • browser_navigate - Navigate to URL
  • browser_click - Click element by ref or selector
  • browser_snapshot - Get page accessibility tree with refs
  • browser_screenshot - Capture screenshot
  • browser_fill - Fill form field
  • browser_evaluate - Execute JavaScript on page

Context System Integration

FeatureIntegration
/cx (capture context)Captures current page URL, title, snapshot, console errors
/cxq "search" (query context)Queries cached browser snapshots for element/content search
/session-logLogs browser actions with timestamps and screenshots
Message busRegisters browser session in messaging.db for cross-LLM coordination

MoE Agent Dispatcher Integration

Auto-route browser-related tasks to coditect-browser-agent:

  • Keywords: browser, webpage, click, navigate, screenshot, form, login, scrape
  • Confidence threshold: 0.7 for automatic routing
  • Fallback: senior-architect for complex browser automation workflows

7. Implementation Recommendations

Phase 1: Core (H.17.2-H.17.3) - Estimated 30-45 hours

  1. Fork protocol layer from agent-browser (Zod schemas, types, protocol)
  2. Implement coditect-browser daemon wrapping Playwright with CODITECT session integration
  3. Implement Rust CLI based on agent-browser patterns (reuse IPC/retry logic)
  4. Add CODITECT-specific commands: /cx integration, session-bus registration, MoE routing

Phase 2: Framework Integration (H.17.4) - Estimated 15-20 hours

  1. Create agent, skill, command, hooks
  2. MCP server for tool exposure
  3. Context system integration (/cx, /cxq)
  4. MoE dispatcher keyword routing

Phase 3: Testing & Docs (H.17.5) - Estimated 7-12 hours

  1. Protocol validation tests
  2. E2E browser workflow tests
  3. Performance benchmarks
  4. User documentation

Key Design Decisions

DecisionRecommendationRationale
Fork vs wrapWrap agent-browser as dependencyFaster, maintained upstream, MIT license
Protocol extensionExtend JSON DSL with CODITECT opsBackward compatible, reuse Zod schemas
Binary distributionnpm + native optional depsFollows agent-browser pattern, proven
Snapshot storageCache in BrowserManager + context.dbEnables /cxq queries across sessions
Session isolationOne daemon per sessionProven isolation, no cross-contamination

Risks & Mitigations

RiskImpactMitigation
Playwright version driftMediumPin playwright-core version, test on upgrade
Binary build CI complexityLowReuse agent-browser's build matrix
Token budget for snapshotsLowAlready 93% reduced; -i flag for minimal
CDP API changesLowOnly used for screencast/input (optional)
iOS automation complexityMediumDefer to Phase 2+; not needed for PILOT

8. Dependencies

Runtime

PackageVersionPurpose
playwright-core^1.57.0Browser automation
zod^3.22.4Schema validation
ws^8.19.0WebSocket (stream server)
node-simctl^7.4.0iOS simulator control
webdriverio^9.15.0iOS automation via Appium

Rust CLI Crates

CrateVersionPurpose
serde1.0JSON serialization
serde_json1.0JSON parsing
dirs5.0Cross-platform home directory
base640.22Script encoding
libc0.2Unix syscalls
windows-sys0.52Win32 process management

9. Test Coverage

Test FileLinesCoverage
protocol.test.ts1076All 154 command schemas
browser.test.ts744Browser launch, tab/window management
ios-manager.test.ts157iOS device listing, session management
daemon.test.ts96HTTP detection, socket directory resolution
actions.test.ts39AI-friendly error translation

Appendix: Key File Paths

All paths relative to submodules/labs/agent-browser/:

FilePurpose
cli/src/main.rsCLI entry point
cli/src/commands.rsCommand parsing and dispatch
cli/src/flags.rsTwo-phase flag parsing
cli/src/connection.rsIPC, daemon lifecycle, retry
cli/Cargo.tomlRust dependencies
src/daemon.tsNode.js IPC server
src/browser.tsPlaywright browser manager
src/actions.ts154 command handlers
src/protocol.tsZod validation schemas
src/types.tsTypeScript types
src/snapshot.tsSnapshot engine
src/stream-server.tsWebSocket stream server
src/ios-manager.tsiOS automation
skills/agent-browser/references/snapshot-refs.mdRef system documentation
skills/agent-browser/references/commands.mdCLI command reference