Agent-Browser Technical Analysis for CODITECT Integration

Task: H.17.1.7 | Track: H (Framework Autonomy) Date: 2026-02-08 Author: Claude (Opus 4.6) Source: submodules/labs/agent-browser (commit 4d8097a) Repository: https://github.com/vercel-labs/agent-browser

Executive Summary

Agent-browser is a hybrid Rust CLI + Node.js Playwright daemon providing 154 browser automation commands via a JSON DSL protocol. It achieves 93% token reduction vs raw HTML through an accessibility-tree snapshot engine with element references (@e1, @e2). The architecture maps cleanly to CODITECT's agent/skill/hook/command framework with zero blockers for integration.

Key Numbers:

154 commands across 26 categories
Rust CLI: <50ms boot, 5 retries with exponential backoff
Snapshot engine: ~200-400 tokens vs 3000-5000 for raw HTML
Supported engines: Chromium, Firefox, WebKit, iOS Safari
Remote providers: Browserbase, Kernel, Browser Use

1. Architecture Overview

                        +-------------------+
                        |   CODITECT Agent  |
                        |  (coditect-browser|
                        |    -agent.md)     |
                        +--------+----------+
                                 |
                    JSON DSL (newline-delimited)
                                 |
              +------------------v------------------+
              |           Rust CLI Binary            |
              |  cli/src/main.rs (530 lines)         |
              |  - Command parsing (commands.rs)     |
              |  - Flag parsing (flags.rs)           |
              |  - IPC connection (connection.rs)    |
              +------------------+------------------+
                                 |
                   Unix Domain Socket (macOS/Linux)
                   TCP Port (Windows)
                                 |
              +------------------v------------------+
              |         Node.js Daemon               |
              |  src/daemon.ts (453 lines)           |
              |  - IPC server                        |
              |  - Session management                |
              |  - Command dispatch                  |
              +------------------+------------------+
                                 |
              +------------------v------------------+
              |        Playwright Core               |
              |  src/browser.ts (1902 lines)         |
              |  - Multi-engine support              |
              |  - Multi-tab/window management       |
              |  - CDP integration                   |
              |  - Screencast/input injection        |
              +------------------+------------------+
                                 |
              +--------+---------+---------+--------+
              |        |         |         |        |
          Chromium  Firefox  WebKit   iOS Safari  Remote
                                    (Appium)    Providers

Component Summary

Component	File	Lines	Purpose
CLI entry	`cli/src/main.rs`	530	Arg parsing, daemon lifecycle
Commands	`cli/src/commands.rs`	895	40+ CLI command handlers
Flags	`cli/src/flags.rs`	183	Two-phase flag parsing (env + CLI)
IPC	`cli/src/connection.rs`	557	Socket/TCP, retry logic, daemon start
Daemon	`src/daemon.ts`	453	IPC server, session management
Browser	`src/browser.ts`	1902	Playwright lifecycle, state tracking
Actions	`src/actions.ts`	2045	154 command handlers
Protocol	`src/protocol.ts`	977	Zod schema validation
Types	`src/types.ts`	1075	TypeScript command/response types
Snapshot	`src/snapshot.ts`	618	Accessibility tree + element refs
Stream	`src/stream-server.ts`	382	WebSocket screencast + input
iOS	`src/ios-manager.ts`	1299	Appium/WebdriverIO Safari

2. Rust CLI Architecture (H.17.1.2)

Flag Parsing (Two-Phase)

Phase 1 - Environment Variables (Priority): flags.rs:38-66

AGENT_BROWSER_SESSION (default: "default")
AGENT_BROWSER_EXECUTABLE_PATH, AGENT_BROWSER_EXTENSIONS
AGENT_BROWSER_PROFILE, AGENT_BROWSER_STATE
AGENT_BROWSER_PROXY, AGENT_BROWSER_PROXY_BYPASS
AGENT_BROWSER_PROVIDER, AGENT_BROWSER_IOS_DEVICE

Phase 2 - CLI Arguments: flags.rs:80-182

Supported: --json, --full/-f, --headed, --debug, --session, --executable-path, --extension, --cdp, --profile, --state, --proxy, --user-agent, -p/--provider, --device
Tracking: cli_*_path booleans warn when flags ignored due to running daemon

Command Dispatch

commands.rs:81-895 - Match-based dispatch to 40+ CLI command handlers.

Categories:

Navigation: open/goto/navigate, back, forward, reload
Core Actions: click, dblclick, fill, type, hover, focus, check, select, drag, upload
Keyboard: press/key, keydown, keyup
Scroll: scroll, scrollintoview
Wait: Complex multi-flag (--url, --load, --fn, --text, --download)
Evaluation: eval with optional base64 encoding
Session: close/quit/exit, connect (CDP)
Queries: get, is, find, mouse, set, network
Data: cookies, storage

IPC Mechanism

Socket Resolution (Ordered Priority): connection.rs:86-108

AGENT_BROWSER_SOCKET_DIR env var
XDG_RUNTIME_DIR (Linux: /run/user/1000/agent-browser)
~/.agent-browser home directory fallback
env::temp_dir() last resort

Platform-Specific:

Unix/macOS: {socket_dir}/{session}.sock (Unix domain socket, max 103 bytes)
Windows: TCP on hash-derived port (formula: 49152 + ((hash % 16383) as u16))

Protocol: Newline-delimited JSON. Read timeout: 30s, Write timeout: 5s.

Retry Logic

connection.rs:484-513 - 5 retries, 200ms exponential backoff.

Transient Error Detection (connection.rs:521-535):

macOS: os error 35 (EAGAIN), 54 (reset), 61 (refused)
Linux: os error 11 (EAGAIN), 104 (reset), 111 (refused)
Cross-platform: WouldBlock, EOF, empty JSON, Broken pipe

Daemon Lifecycle

ensure_daemon() (connection.rs:206-465):

Check if daemon running (double-check with 150ms sleep for race condition)
Clean stale .sock/.pid files
Validate socket path length (103 bytes max)
Test directory writeability
Fork+detach: libc::setsid() (Unix), CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS (Windows)
Readiness polling: 50 iterations x 100ms = 5s timeout

Binary Distribution

GitHub releases per platform: agent-browser-{os}-{arch}[.exe]
npm postinstall.js downloads binary, patches global npm shims to bypass Node.js wrapper
Cargo release profile: opt-level=3, lto=true, codegen-units=1, strip=true

3. Node.js Daemon Architecture (H.17.1.3)

Daemon Entry (`daemon.ts`)

Creates net.Server listening on Unix socket or TCP port
Writes PID file for lifecycle management
Rejects HTTP requests (security: detects GET/POST/PUT/... pattern)
Per-command try/catch with graceful error responses
Signal handlers: SIGINT, SIGTERM, SIGHUP for cleanup
uncaughtException/unhandledRejection handlers clean socket before exit

Browser Manager (`browser.ts`)

State Tracking:

contexts: BrowserContext[], pages: Page[], activePageIndex: number
refMap: RefMap, consoleMessages, pageErrors
cdpSession: CDPSession | null

Multi-Engine Support: Chromium (default), Firefox, WebKit via Playwright.

Extensions: Chromium only
File access (--allow-file-access): Chromium only

Remote Providers:

Browserbase (browser.ts:743-799): BROWSERBASE_API_KEY + BROWSERBASE_PROJECT_ID
Kernel (browser.ts:850-940): KERNEL_API_KEY
Browser Use (browser.ts:946-1013): BROWSER_USE_API_KEY

Session Isolation

One daemon per session. Each session has:

Dedicated socket/port/PID files
Independent BrowserManager instance
Separate cookie/storage/auth state

Multi-Tab Within Session:

newTab(): New page in first context
newWindow(): New context with separate page
switchTo(index): Switch active page
closeTab(index): Close specific tab

CDP Integration (`browser.ts:1437-1605`)

Screencast: Page.startScreencast via CDP (JPEG/PNG frames, configurable quality/resolution)
Mouse injection: Input.dispatchMouseEvent (mousePressed/Released/Moved/Wheel)
Keyboard injection: Input.dispatchKeyEvent (keyDown/keyUp/char)
Touch injection: Input.dispatchTouchEvent (touchStart/End/Move/Cancel)

WebSocket Stream Server (`stream-server.ts`)

Port 9223 default
Origin validation: rejects browser origins (prevents CORS bypass)
Message types: Frame, InputMouse, InputKeyboard, InputTouch, Status, Error

4. JSON DSL Protocol (H.17.1.4)

Envelope Format

Request:

{"id": "r123456", "action": "click", "selector": "@e1"}

Success Response:

{"id": "r123456", "success": true, "data": {"clicked": true}}

Error Response:

{"id": "r123456", "success": false, "error": "Element \"@e1\" is blocked by another element"}

Validation

Zod discriminated union (protocol.ts:796-922) validates all 154 commands at runtime. parseCommand() returns typed result or validation error with field paths.

Command Catalog (154 Commands, 26 Categories)

Category	Count	Key Commands
Session/Lifecycle	8	`launch`, `close`, `tab_new`, `tab_list`, `tab_switch`, `connect`
Navigation	6	`navigate`, `back`, `forward`, `reload`, `url`, `title`
Element Interaction	24	`click`, `type`, `fill`, `press`, `hover`, `check`, `select`, `drag`, `upload`
Element State	8	`gettext`, `getattribute`, `getvalue`, `isvisible`, `isenabled`, `ischecked`
Element Measurement	4	`count`, `boundingbox`, `styles`, `content`
Frame Handling	2	`frame`, `mainframe`
Semantic Locators	7	`getbyrole`, `getbytext`, `getbylabel`, `getbyplaceholder`, `getbytestid`
Position Selection	1	`nth`
Wait Operations	5	`wait`, `waitforurl`, `waitforloadstate`, `waitforfunction`, `waitfordownload`
Cookies	3	`cookies_get`, `cookies_set`, `cookies_clear`
Storage	3	`storage_get`, `storage_set`, `storage_clear`
Network	4	`route`, `unroute`, `requests`, `responsebody`
Dialog Handling	1	`dialog`
Emulation	8	`viewport`, `device`, `useragent`, `geolocation`, `permissions`, `timezone`
HTTP/Headers	2	`headers`, `offline`
Media Emulation	1	`emulatemedia`
Download/PDF	2	`download`, `pdf`
Screenshots/Snapshots	2	`screenshot`, `snapshot`
JS Execution	6	`evaluate`, `evalhandle`, `addscript`, `addstyle`, `addinitscript`, `expose`
Debugging	4	`console`, `errors`, `highlight`, `pause`
Video/Recording	5	`video_start/stop`, `recording_start/stop/restart`
Tracing/HAR	4	`trace_start/stop`, `har_start/stop`
State Persistence	2	`state_save`, `state_load`
Mouse Control	5	`mousemove`, `mousedown`, `mouseup`, `wheel`, `bringtofront`
Streaming/Input	5	`screencast_start/stop`, `input_mouse`, `input_keyboard`, `input_touch`
iOS-Specific	2	`swipe`, `device_list`

AI-Friendly Error Translation (`actions.ts:151-204`)

Playwright errors are converted to actionable AI messages:

Multiple matches: "Selector matched N elements. Run 'snapshot' to get updated refs."
Blocked by overlay: "Element blocked by another element. Try dismissing modals/cookie banners."
Not visible: "Element not visible. Try scrolling into view."
Timeout: "Action timed out. Run 'snapshot' to check current page state."
Not found: "Element not found. Run 'snapshot' to see current page elements."

5. Snapshot Engine (H.17.1.5)

How It Works

Calls Playwright's ariaSnapshot() on root or scoped CSS selector
Processes accessibility tree line-by-line (O(n) single pass)
Assigns auto-incrementing refs (e1, e2, ...) to interactive/named-content elements
Returns enhanced tree text + RefMap for subsequent commands

Element Reference System

Ref format: @e1, @e2, etc.

Generated per-snapshot (counter resets each time)
Cached in BrowserManager.refMap until next snapshot
Invalidated on page navigation (must re-snapshot)
Resolution: browser.getLocator("@e1") -> Playwright getByRole() with exact name match

RefMap Structure:

interface RefMap {
  [ref: string]: {
    selector: string;  // e.g., "getByRole('button', { name: \"Submit\", exact: true })"
    role: string;      // e.g., 'button', 'link', 'textbox'
    name?: string;     // e.g., "Submit"
    nth?: number;      // Disambiguation index (only for duplicates)
  };
}

Filtering Modes

Mode	Flag	Effect
Interactive only	`-i`	Only buttons, links, inputs, etc. (17 ARIA roles)
Compact	`-c`	Removes unnamed structural elements without ref-containing children
Depth limit	`-d N`	Cuts tree at depth N
CSS scope	`-s "selector"`	Scopes to CSS selector subtree
Cursor detection	`--cursor`	Detects `cursor:pointer`, `onclick`, `tabindex` elements

Performance

93% token reduction: ~200-400 tokens vs 3000-5000 for raw HTML
Single-pass O(n) line processing
Duplicate handling: RoleNameTracker adds nth only when >1 match

6. Capability Mapping to CODITECT Patterns (H.17.1.6)

Agent Mapping

agent-browser Feature	CODITECT Agent	Integration Point
Browser automation (154 commands)	`coditect-browser-agent.md`	Primary agent for all browser tasks
Screenshot/snapshot	`frontend-development-agent`	Visual testing, component screenshots
Network interception	`api-integration-specialist`	API mocking, request capture
Accessibility tree	`accessibility-testing-specialist`	WCAG compliance scanning
Session management	`multi-agent-coordinator`	Multi-browser session orchestration
Error translation	`debugger`	Browser error diagnosis
iOS automation	`mobile-testing-specialist`	Cross-platform mobile testing

Skill Mapping

agent-browser Capability	CODITECT Skill	Track
Browser control patterns	`browser-automation-patterns/SKILL.md`	H.17
Snapshot + ref system	Extension of `memory-context-patterns/SKILL.md`	J
JSON DSL protocol	Extension of `api-design-patterns/SKILL.md`	A
Binary distribution	Extension of `binary-distribution-patterns/SKILL.md`	C
Error recovery	Extension of `error-handling-resilience/SKILL.md`	H
State persistence	Extension of `cloud-native-patterns/SKILL.md`	C

Hook Mapping

Hook	Trigger	Purpose
`browser-auto-launch.py`	`PreToolUse:Bash`	Auto-launch daemon when browser commands detected
`browser-screenshot-on-error.py`	`PostToolUse:Bash`	Auto-screenshot on page/navigation errors
`browser-snapshot-cache.py`	`PostToolUse:Bash`	Cache last snapshot in context for `/cxq` queries
`browser-session-cleanup.py`	`SessionEnd`	Clean up daemon processes on session end

Command Mapping

CODITECT Command	Implementation	Purpose
`/browser navigate <url>`	`navigate` action	Open URL in browser
`/browser click <selector>`	`click` action	Click element
`/browser snapshot`	`snapshot` action	Get page accessibility tree
`/browser screenshot [path]`	`screenshot` action	Capture screenshot
`/browser fill <selector> <value>`	`fill` action	Fill form field
`/browser eval <script>`	`evaluate` action	Execute JavaScript
`/browser session list`	session list	List active browser sessions
`/browser close`	`close` action	Close browser

MCP Server Integration

Expose browser tools to all CODITECT agents via MCP:

browser_navigate - Navigate to URL
browser_click - Click element by ref or selector
browser_snapshot - Get page accessibility tree with refs
browser_screenshot - Capture screenshot
browser_fill - Fill form field
browser_evaluate - Execute JavaScript on page

Context System Integration

Feature	Integration
`/cx` (capture context)	Captures current page URL, title, snapshot, console errors
`/cxq "search"` (query context)	Queries cached browser snapshots for element/content search
`/session-log`	Logs browser actions with timestamps and screenshots
Message bus	Registers browser session in `messaging.db` for cross-LLM coordination

MoE Agent Dispatcher Integration

Auto-route browser-related tasks to coditect-browser-agent:

Keywords: browser, webpage, click, navigate, screenshot, form, login, scrape
Confidence threshold: 0.7 for automatic routing
Fallback: senior-architect for complex browser automation workflows

7. Implementation Recommendations

Phase 1: Core (H.17.2-H.17.3) - Estimated 30-45 hours

Fork protocol layer from agent-browser (Zod schemas, types, protocol)
Implement coditect-browser daemon wrapping Playwright with CODITECT session integration
Implement Rust CLI based on agent-browser patterns (reuse IPC/retry logic)
Add CODITECT-specific commands: /cx integration, session-bus registration, MoE routing

Phase 2: Framework Integration (H.17.4) - Estimated 15-20 hours

Create agent, skill, command, hooks
MCP server for tool exposure
Context system integration (/cx, /cxq)
MoE dispatcher keyword routing

Phase 3: Testing & Docs (H.17.5) - Estimated 7-12 hours

Protocol validation tests
E2E browser workflow tests
Performance benchmarks
User documentation

Key Design Decisions

Decision	Recommendation	Rationale
Fork vs wrap	Wrap agent-browser as dependency	Faster, maintained upstream, MIT license
Protocol extension	Extend JSON DSL with CODITECT ops	Backward compatible, reuse Zod schemas
Binary distribution	npm + native optional deps	Follows agent-browser pattern, proven
Snapshot storage	Cache in BrowserManager + context.db	Enables `/cxq` queries across sessions
Session isolation	One daemon per session	Proven isolation, no cross-contamination

Risks & Mitigations

Risk	Impact	Mitigation
Playwright version drift	Medium	Pin playwright-core version, test on upgrade
Binary build CI complexity	Low	Reuse agent-browser's build matrix
Token budget for snapshots	Low	Already 93% reduced; `-i` flag for minimal
CDP API changes	Low	Only used for screencast/input (optional)
iOS automation complexity	Medium	Defer to Phase 2+; not needed for PILOT

8. Dependencies

Runtime

Package	Version	Purpose
playwright-core	^1.57.0	Browser automation
zod	^3.22.4	Schema validation
ws	^8.19.0	WebSocket (stream server)
node-simctl	^7.4.0	iOS simulator control
webdriverio	^9.15.0	iOS automation via Appium

Rust CLI Crates

Crate	Version	Purpose
serde	1.0	JSON serialization
serde_json	1.0	JSON parsing
dirs	5.0	Cross-platform home directory
base64	0.22	Script encoding
libc	0.2	Unix syscalls
windows-sys	0.52	Win32 process management

9. Test Coverage

Test File	Lines	Coverage
`protocol.test.ts`	1076	All 154 command schemas
`browser.test.ts`	744	Browser launch, tab/window management
`ios-manager.test.ts`	157	iOS device listing, session management
`daemon.test.ts`	96	HTTP detection, socket directory resolution
`actions.test.ts`	39	AI-friendly error translation

Appendix: Key File Paths

All paths relative to submodules/labs/agent-browser/:

File	Purpose
`cli/src/main.rs`	CLI entry point
`cli/src/commands.rs`	Command parsing and dispatch
`cli/src/flags.rs`	Two-phase flag parsing
`cli/src/connection.rs`	IPC, daemon lifecycle, retry
`cli/Cargo.toml`	Rust dependencies
`src/daemon.ts`	Node.js IPC server
`src/browser.ts`	Playwright browser manager
`src/actions.ts`	154 command handlers
`src/protocol.ts`	Zod validation schemas
`src/types.ts`	TypeScript types
`src/snapshot.ts`	Snapshot engine
`src/stream-server.ts`	WebSocket stream server
`src/ios-manager.ts`	iOS automation
`skills/agent-browser/references/snapshot-refs.md`	Ref system documentation
`skills/agent-browser/references/commands.md`	CLI command reference

Executive Summary​

1. Architecture Overview​

Component Summary​

2. Rust CLI Architecture (H.17.1.2)​

Flag Parsing (Two-Phase)​

Command Dispatch​

IPC Mechanism​

Retry Logic​

Daemon Lifecycle​

Binary Distribution​

3. Node.js Daemon Architecture (H.17.1.3)​

Daemon Entry (daemon.ts)​

Browser Manager (browser.ts)​

Session Isolation​

CDP Integration (browser.ts:1437-1605)​

WebSocket Stream Server (stream-server.ts)​

4. JSON DSL Protocol (H.17.1.4)​

Envelope Format​

Validation​

Command Catalog (154 Commands, 26 Categories)​

AI-Friendly Error Translation (actions.ts:151-204)​

5. Snapshot Engine (H.17.1.5)​

How It Works​

Element Reference System​

Filtering Modes​

Performance​

6. Capability Mapping to CODITECT Patterns (H.17.1.6)​

Agent Mapping​

Skill Mapping​

Hook Mapping​

Command Mapping​

MCP Server Integration​

Context System Integration​

MoE Agent Dispatcher Integration​

7. Implementation Recommendations​

Phase 1: Core (H.17.2-H.17.3) - Estimated 30-45 hours​

Phase 2: Framework Integration (H.17.4) - Estimated 15-20 hours​

Phase 3: Testing & Docs (H.17.5) - Estimated 7-12 hours​

Key Design Decisions​

Risks & Mitigations​

8. Dependencies​

Runtime​

Rust CLI Crates​

9. Test Coverage​

Appendix: Key File Paths​