ADR-167: Claude Code CLI Integration for Zero-Cost LLM Access
Status: Proposed Date: 2026-02-09 Author: Claude (Opus 4.6) Deciders: Hal Casteel, Engineering Team Tags: llm, claude-code, cli, cost-optimization, sidecar
Context
The codestoryai/sidecar supports 12 LLM providers (OpenAI, Anthropic, Ollama, OpenRouter, Groq, Azure, etc.) via its llm_client crate. Each requires API keys and incurs per-token costs.
CODITECT's development workflow already relies on Claude Code CLI (claude), which provides access to Claude models through the user's existing Anthropic subscription at no additional per-token API cost. The serve.py API server for the UDOM Pipeline Navigator already demonstrated this pattern: claude --print --message <prompt> subprocess invocation with 5s rate limiting.
We need to integrate Claude Code CLI as an LLM provider in the sidecar to eliminate API costs during development.
Decision
Add ClaudeCodeCLI as a new LLM provider in the sidecar's llm_client crate, using subprocess invocation of the locally-installed claude CLI binary.
Implementation
- Add
ClaudeCodeCLIvariant toLLMProviderenum inllm_client/src/provider.rs - Create
ClaudeCodeCLIClientimplementing theLLMClienttrait - Use
tokio::process::Commandfor async subprocess management - Implement rate limiting (5s minimum between calls) to respect CLI SLA
- Support streaming by reading stdout line-by-line
- Fall back to API providers if CLI is not installed or fails
Provider Priority Chain
ClaudeCodeCLI (zero cost) -> Ollama (free, local) -> OpenRouter (cheap) -> Anthropic API (direct)
Rate Limiting
| Parameter | Value | Rationale |
|---|---|---|
| Min delay between calls | 5 seconds | Respect CLI rate limits |
| Max concurrent calls | 1 | CLI is single-threaded |
| Timeout per call | 120 seconds | Long reasoning tasks |
| Retry on failure | 2 attempts | CLI can be flaky |
| Queue depth | 10 | Backpressure limit |
Alternatives Considered
Alternative 1: Anthropic API Direct
Use Anthropic API with API keys stored in sidecar config.
Rejected as default because:
- Per-token cost accumulates rapidly during development
- User already pays for Claude Code subscription
- API key management is an extra burden
- Retained as fallback when CLI unavailable
Alternative 2: MCP Server Bridge
Create an MCP server that wraps Claude Code CLI and connect via mcp_client_rs.
Rejected because:
- Over-engineered for subprocess invocation
- MCP adds protocol overhead for a simple request-response
- No benefit over direct subprocess for single-provider use
Alternative 3: Ollama with Claude-Compatible Model
Use Ollama with a local model (e.g., Llama 3) instead of Claude.
Rejected as primary because:
- Significantly lower quality for code editing tasks
- Retained as secondary fallback in the priority chain
Consequences
Positive
- Zero incremental LLM cost for all sidecar operations
- No API key configuration required
- Uses same Claude model quality as direct API
- Proven pattern: serve.py already validated this approach
- Transparent fallback to API providers if CLI unavailable
Negative
- 5s rate limit constrains throughput (1 request per 5 seconds max)
- Single-threaded: no concurrent LLM requests
- Depends on
claudebinary being installed and authenticated - No streaming support initially (stdout capture is batch)
- CLI subprocess is ~500ms slower startup than direct API call
Mitigations
- Queue with backpressure prevents overloading
- MCTS planning batches multiple decisions into single prompts
- Agent loop can pre-fetch context while waiting for rate limit
- Streaming support can be added later by reading stdout incrementally
Related
- ADR-165: WASM Split Architecture
- SDD:
docs/architecture/browser-ide/SDD-CODITECT-BROWSER-IDE.md - serve.py:
analyze-new-artifacts/udom-batch-runs/serve.py(reference implementation)