Skip to main content

ADR-111: Token Economics Instrumentation

Status

Proposed

Context

Problem Statement

Multi-agent autonomous development has a 15x token multiplier compared to single-agent workflows. Without instrumentation:

  • Costs are unpredictable and can spike unexpectedly
  • Budget overruns are detected too late
  • Cost attribution per task/agent is impossible
  • Optimization opportunities are invisible
  • Enterprise customers cannot forecast spend

Key Insight from Ralph Wiggum Analysis

"Token economics matter—15x multiplier for multi-agent means cost awareness is essential."

Current State

  • No per-agent token tracking
  • No per-task cost attribution
  • No budget enforcement
  • No cost forecasting
  • No real-time cost visibility
  • No throttling on budget exceeded

Decision

Implement comprehensive token economics instrumentation that tracks consumption per agent/task/iteration, enforces budget limits with auto-throttling, enables cost projection, and supports enterprise chargeback.

1. Token Tracking Schema

token_record:
record_id: string # UUID
timestamp: datetime # When consumption occurred

context:
organization_id: string # For multi-tenant
project_id: string # Project/workspace
task_id: string # Parent task
agent_id: string # Executing agent
iteration: integer # Loop iteration
checkpoint_id: string # Associated checkpoint

consumption:
model: string # claude-opus-4-5, claude-sonnet-4-5, etc.
input_tokens: integer # Prompt tokens
output_tokens: integer # Completion tokens
cache_read_tokens: integer # Prompt cache hits
cache_write_tokens: integer # Prompt cache writes
total_tokens: integer # Sum of all

cost:
input_cost_usd: decimal # Calculated cost
output_cost_usd: decimal
cache_cost_usd: decimal
total_cost_usd: decimal

metadata:
tool_calls: array # Tools invoked in this call
latency_ms: integer # API response time
success: boolean # Whether call succeeded
error_type: string # If failed, error category

2. Budget Hierarchy

budgets:
organization:
monthly_limit_usd: 10000
alert_threshold_percent: 80
hard_limit_action: "throttle" # throttle, pause, alert_only

project:
daily_limit_usd: 500
task_limit_usd: 50
alert_threshold_percent: 75

task:
max_tokens: 1000000
max_iterations: 50
max_cost_usd: 100
per_iteration_limit_tokens: 50000

agent:
max_tokens_per_call: 100000
max_cost_per_call_usd: 5
rate_limit_calls_per_minute: 60

inheritance:
# Lower level cannot exceed parent
organization → project → task → agent

3. Budget Enforcement Protocol

PRE-CALL CHECK:
1. Retrieve current consumption for context
2. Estimate call cost (based on prompt size + expected output)
3. Check against all applicable budgets:
├── Agent call limit
├── Task iteration limit
├── Task total limit
├── Project daily limit
└── Organization monthly limit
4. If ANY limit would be exceeded:
├── If hard limit: Block call, return budget_exceeded error
├── If soft limit: Log warning, proceed with monitoring
└── If alert threshold: Notify but proceed

POST-CALL RECORDING:
1. Record actual consumption
2. Update running totals
3. Check if any threshold newly crossed
4. Emit events for alerts/dashboard

THROTTLING BEHAVIOR:
├── Approach threshold (80%): Emit warning event
├── At threshold (100%):
│ ├── throttle: Delay calls by 1s, then 2s, then 4s...
│ ├── pause: Block all calls until budget reset/increase
│ └── alert_only: Log and continue
└── Override available for emergency/human approval

4. Pricing Model (Configurable)

interface ModelPricing {
model: string;
inputPerMillionTokens: number; // USD
outputPerMillionTokens: number; // USD
cacheReadPerMillionTokens: number;
cacheWritePerMillionTokens: number;
}

const PRICING: ModelPricing[] = [
{
model: "claude-opus-4-5-20251101",
inputPerMillionTokens: 15.00,
outputPerMillionTokens: 75.00,
cacheReadPerMillionTokens: 1.50,
cacheWritePerMillionTokens: 18.75
},
{
model: "claude-sonnet-4-5-20250929",
inputPerMillionTokens: 3.00,
outputPerMillionTokens: 15.00,
cacheReadPerMillionTokens: 0.30,
cacheWritePerMillionTokens: 3.75
},
{
model: "claude-haiku-4-5-20251001",
inputPerMillionTokens: 0.80,
outputPerMillionTokens: 4.00,
cacheReadPerMillionTokens: 0.08,
cacheWritePerMillionTokens: 1.00
}
];

function calculateCost(record: TokenConsumption): CostBreakdown {
const pricing = PRICING.find(p => p.model === record.model);
return {
input: (record.input_tokens / 1_000_000) * pricing.inputPerMillionTokens,
output: (record.output_tokens / 1_000_000) * pricing.outputPerMillionTokens,
cacheRead: (record.cache_read_tokens / 1_000_000) * pricing.cacheReadPerMillionTokens,
cacheWrite: (record.cache_write_tokens / 1_000_000) * pricing.cacheWritePerMillionTokens,
total: /* sum of above */
};
}

5. Aggregation and Efficiency Metrics

real_time_aggregations:  # Updated on each record
- per_agent_running_total
- per_task_running_total
- per_project_daily_total
- per_organization_monthly_total
- global_system_total

periodic_rollups: # Hourly, daily, monthly
- task_summaries
- project_summaries
- model_usage_breakdown
- agent_efficiency_metrics
- cost_trend_analysis

efficiency_metrics:
tokens_per_tool_call: total_tokens / tool_calls
cost_per_iteration: total_cost / iterations
cache_hit_rate: cache_read / (cache_read + input)
output_input_ratio: output_tokens / input_tokens
cost_per_completed_task: total_cost / successful_tasks

6. Forecasting

forecasting:
short_term: # Task-level
based_on: "Current consumption rate, remaining work estimate"
formula: "current_rate * estimated_remaining_iterations"
confidence: "Based on variance in iteration costs"
update_frequency: "Every iteration"

medium_term: # Project-level
based_on: "Historical project patterns, active task count"
formula: "avg_task_cost * remaining_tasks"
considers: "Task complexity estimates"
update_frequency: "Daily"

long_term: # Organization-level
based_on: "Monthly trends, growth rate, planned projects"
formula: "Exponential smoothing of historical data"
considers: "Seasonality, team growth"
update_frequency: "Weekly"

7. Database Schema

Database Architecture Note (ADR-002, ADR-089):

  • Local: SQLite (context.db) for offline operation and immediate recording
  • Cloud: PostgreSQL with RLS for multi-tenant isolation and aggregations
  • Sync via cursor-based polling (ADR-053)

Local SQLite (context.db):

CREATE TABLE token_records (
id TEXT PRIMARY KEY,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
task_id TEXT,
agent_id TEXT,
iteration INTEGER,
checkpoint_id TEXT,
model TEXT NOT NULL,
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0,
cache_read_tokens INTEGER DEFAULT 0,
cache_write_tokens INTEGER DEFAULT 0,
total_tokens INTEGER GENERATED ALWAYS AS (input_tokens + output_tokens + cache_read_tokens + cache_write_tokens) STORED,
total_cost_usd REAL,
tool_calls TEXT, -- JSON array
latency_ms INTEGER,
success INTEGER DEFAULT 1,
synced INTEGER DEFAULT 0
);

CREATE TABLE budgets (
id TEXT PRIMARY KEY,
scope TEXT NOT NULL, -- organization, project, task, agent
scope_id TEXT NOT NULL,
limit_type TEXT NOT NULL, -- tokens, cost_usd, calls
limit_value REAL NOT NULL,
current_value REAL DEFAULT 0,
period TEXT, -- hourly, daily, monthly, total
action TEXT DEFAULT 'alert_only', -- throttle, pause, alert_only
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_token_records_task ON token_records(task_id, timestamp DESC);
CREATE INDEX idx_token_records_unsynced ON token_records(synced) WHERE synced = 0;

Cloud PostgreSQL (with RLS):

CREATE TABLE token_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
project_id UUID REFERENCES projects(id),
task_id UUID REFERENCES tasks(id),
agent_id TEXT NOT NULL,
iteration INTEGER DEFAULT 1,
checkpoint_id UUID REFERENCES checkpoints(id),
model TEXT NOT NULL,
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0,
cache_read_tokens INTEGER DEFAULT 0,
cache_write_tokens INTEGER DEFAULT 0,
total_tokens INTEGER GENERATED ALWAYS AS (input_tokens + output_tokens + cache_read_tokens + cache_write_tokens) STORED,
input_cost_usd DECIMAL(10,6),
output_cost_usd DECIMAL(10,6),
cache_cost_usd DECIMAL(10,6),
total_cost_usd DECIMAL(10,6),
tool_calls JSONB,
latency_ms INTEGER,
success BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Row-Level Security
ALTER TABLE token_records ENABLE ROW LEVEL SECURITY;

CREATE POLICY org_isolation_tokens ON token_records
FOR ALL USING (
organization_id IN (
SELECT organization_id FROM organization_members
WHERE user_id = current_setting('app.current_user_id')::UUID
)
);

-- Budgets with hierarchy
CREATE TABLE budgets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
project_id UUID REFERENCES projects(id),
task_id UUID REFERENCES tasks(id),
scope TEXT NOT NULL,
limit_type TEXT NOT NULL,
limit_value DECIMAL(12,2) NOT NULL,
current_value DECIMAL(12,2) DEFAULT 0,
period TEXT,
action TEXT DEFAULT 'alert_only',
alert_threshold_percent INTEGER DEFAULT 80,
updated_at TIMESTAMPTZ DEFAULT NOW()
);

ALTER TABLE budgets ENABLE ROW LEVEL SECURITY;

-- Materialized view for aggregations (refreshed periodically)
CREATE MATERIALIZED VIEW token_aggregations AS
SELECT
organization_id,
project_id,
DATE_TRUNC('day', created_at) as period_day,
model,
SUM(total_tokens) as total_tokens,
SUM(total_cost_usd) as total_cost,
COUNT(*) as call_count,
AVG(latency_ms) as avg_latency_ms
FROM token_records
GROUP BY organization_id, project_id, DATE_TRUNC('day', created_at), model;

CREATE UNIQUE INDEX idx_token_agg ON token_aggregations(organization_id, project_id, period_day, model);

8. Event Definitions

type TokenEconomicsEvents =
| { type: 'TOKEN_RECORDED'; payload: {
recordId: string; taskId: string; tokens: number; cost: number
}}
| { type: 'BUDGET_THRESHOLD_CROSSED'; payload: {
context: string; threshold: number; current: number
}}
| { type: 'BUDGET_EXHAUSTED'; payload: {
context: string; limit: number; action: string
}}
| { type: 'THROTTLE_ACTIVATED'; payload: {
context: string; delayMs: number
}}
| { type: 'CONSUMPTION_SPIKE'; payload: {
context: string; rate: number; threshold: number
}}
| { type: 'FORECAST_WARNING'; payload: {
context: string; forecastedCost: number; budget: number
}};

9. Configuration Schema

token_economics:
enabled: true

recording:
buffer_size: 100
flush_interval_ms: 1000
retry_attempts: 3

pricing:
source: "config" # config, api, or external_service
update_interval: "daily"

budgets:
default_organization_monthly_usd: 10000
default_project_daily_usd: 500
default_task_max_usd: 100
alert_threshold_percent: 80
hard_limit_action: "throttle"

throttling:
initial_delay_ms: 1000
max_delay_ms: 60000
backoff_multiplier: 2

forecasting:
task_lookback_iterations: 10
project_lookback_days: 30
confidence_level: 0.9

alerting:
channels:
- type: "slack"
webhook_url: "${SLACK_WEBHOOK_URL}"
events: ["budget_threshold", "budget_exhausted", "spike"]
- type: "pagerduty"
routing_key: "${PAGERDUTY_KEY}"
events: ["budget_exhausted"]

dashboard:
refresh_interval_ms: 5000
retention_days: 90

Consequences

Positive

  1. Cost Visibility - Real-time tracking of all token consumption
  2. Budget Control - Automatic enforcement prevents overruns
  3. Cost Attribution - Per-task/agent chargeback for enterprise
  4. Optimization - Efficiency metrics identify improvement opportunities
  5. Forecasting - Predictable costs enable planning
  6. Throttling - Graceful degradation when limits approached

Negative

  1. Recording Overhead - Each API call has tracking overhead (~10ms)
  2. Storage Growth - Token records accumulate (mitigated by retention)
  3. Complexity - Budget checks on every call path
  4. Pricing Maintenance - Must track model pricing changes

Risks

RiskLikelihoodImpactMitigation
Pricing changesMediumIncorrect costsConfigurable pricing, alerts on API changes
High-volume recording impactMediumPerformance degradationAsync buffered writes
Budget race conditionsLowOver-budget executionAtomic budget checks with reservations
Forecast inaccuracyMediumPoor planningConfidence intervals, multiple models

Performance Requirements

MetricTarget
Token recording latency< 10ms
Budget check latency< 50ms
Aggregation query latency< 100ms
Recording throughput10,000+ records/minute
Dashboard refresh< 5s
Token tracking accuracy> 99%
Budget enforcement latency< 1s
Cost projection accuracy±10%

Implementation Phases

Phase 1: Token Recording (Week 1)

  • Token record schema definition
  • Recording service with buffering
  • API client wrapper for consumption capture

Phase 2: Budget System (Week 2)

  • Budget configuration and hierarchy
  • Budget enforcement service
  • Running totals with real-time updates

Phase 3: Aggregation and Reporting (Week 3)

  • Aggregation service (real-time + periodic)
  • Query API for consumption data
  • Forecasting engine

Phase 4: Dashboard and Alerts (Week 4)

  • Dashboard data service (WebSocket updates)
  • Alert evaluation and notification
  • Testing and documentation
  • ADR-002: PostgreSQL as Primary Database (cloud storage with RLS)
  • ADR-053: Cloud Context Sync Architecture (local-to-cloud sync)
  • ADR-089: Two-Database Architecture (SQLite local + PostgreSQL cloud)
  • ADR-108: Checkpoint Protocol (token metrics in checkpoint)
  • ADR-109: Browser Automation (browser test cost attribution)
  • ADR-110: Health Monitoring (consumption anomaly detection)
  • ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)

Glossary

TermDefinition
TokenBasic unit of text processed by LLMs; approximately 4 characters or 0.75 words in English
Input TokenToken sent to the model as part of the prompt (context, instructions, user message)
Output TokenToken generated by the model in its response
Cache TokenToken retrieved from or written to prompt cache, reducing API costs
BudgetSpending limit configured at organization, project, task, or agent level
ThrottlingSlowing request rate when approaching budget limits with exponential backoff
ChargebackAttribution of costs to specific organizational units for billing
ForecastingPredicting future token consumption based on historical patterns
AggregationCombining token records into summary statistics (hourly, daily, monthly)
RLSRow-Level Security - PostgreSQL feature for automatic tenant data isolation
Materialized ViewPre-computed query result stored in database for fast aggregation queries
JSONBPostgreSQL binary JSON type with indexing support for efficient queries
Exponential BackoffDelay strategy that doubles wait time between retries (1s, 2s, 4s, ...)
WebSocketFull-duplex communication protocol for real-time updates

References


ADR-111 | Created: 2026-01-24 | Status: Proposed