Skip to main content

Software Design Document (SDD): Coditect Web Intelligence Layer

Document ID: SDD-2026-0204-003
Date: February 4, 2026
Version: 1.0
Status: Draft
Authors: Coditect Architecture Team


1. Introduction

1.1 Purpose

This SDD defines the design for integrating Gemini API URL Context capabilities into the Coditect autonomous development platform. The integration creates a Web Intelligence Layer that enables Coditect's multi-agent system to autonomously research, analyze, and ground decisions in live web content.

1.2 Scope

In ScopeOut of Scope
URL Context tool adapter for Coditect agentsAuthenticated/private URL access
Compliance document research agentCustom web scraping infrastructure
FoundationDB caching for fetched contentYouTube/video content processing
Audit trail for compliance groundingGoogle Workspace document access
Multi-agent delegation patternsReal-time streaming URL monitoring

1.3 Referenced Documents

  • ADR-2026-007: Adoption of Gemini URL Context for Web Intelligence
  • ADR-2026-008: REST API vs SDK Strategy for Gemini Integration
  • ADR-2026-009: Compliance Document Caching Architecture
  • EXEC-2026-0204-001: Executive Summary
  • TECH-2026-0204-002: Detailed Technical Analysis

2. System Architecture

2.1 High-Level Architecture

The Web Intelligence Layer sits between Coditect's Agent Orchestrator and the Gemini API, providing a unified interface for web content access with compliance-aware caching and audit capabilities.

┌─────────────────────────────────────────────────────────┐
│ CODITECT PLATFORM │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Orchestrator │ │ Researcher │ │ Compliance │ │
│ │ Agent │ │ Agents │ │ Agent │ │
│ └──────┬───────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ ┌──────┴──────────────────┴──────────────────┴──────┐ │
│ │ WEB INTELLIGENCE LAYER │ │
│ │ │ │
│ │ ┌─────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ URL Context │ │ Content │ │ Audit │ │ │
│ │ │ Adapter │ │ Cache │ │ Trail │ │ │
│ │ └──────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────┴───────────────┴───────────────┴──────┐ │ │
│ │ │ FoundationDB State Store │ │ │
│ │ └─────────────────────┬───────────────────────┘ │ │
│ └────────────────────────┼──────────────────────────┘ │
└───────────────────────────┼─────────────────────────────┘

┌────────┴────────┐
│ Gemini API │
│ URL Context + │
│ Google Search │
└─────────────────┘

2.2 Component Design

2.2.1 URL Context Adapter

Responsibility: Wraps the Gemini API URL Context tool into Coditect's standardized tool interface.

@dataclass(frozen=True)
class URLContextRequest:
"""Request to fetch and analyze URL content."""
urls: List[str] # 1-20 URLs
analysis_prompt: str # What to extract/analyze
model: str = "gemini-2.5-flash" # Model selection
combine_with_search: bool = False # Also enable Google Search
compliance_mode: bool = False # Enable audit trail
cache_strategy: CacheStrategy = CacheStrategy.READ_THROUGH
max_retries: int = 3
timeout_seconds: float = 30.0

@dataclass
class URLContextResponse:
"""Response from URL Context analysis."""
content: str # Generated analysis
url_metadata: List[URLMetadata] # Retrieval status per URL
token_usage: TokenUsage # Input/output tokens consumed
cached: bool # Whether result was from local cache
audit_record: Optional[AuditRecord] # Compliance audit trail
timestamp: datetime

2.2.2 Content Cache

Responsibility: FoundationDB-backed cache to reduce redundant API calls and maintain compliance document snapshots.

class ContentCache:
"""FoundationDB-backed URL content cache with TTL management."""

NAMESPACE = "web_intelligence"

# Cache TTL by content category
TTL_MAP = {
ContentCategory.REGULATORY: timedelta(hours=24), # FDA/HIPAA docs
ContentCategory.DOCUMENTATION: timedelta(hours=6), # API docs
ContentCategory.NEWS: timedelta(hours=1), # News/blog posts
ContentCategory.STATIC: timedelta(days=7), # Reference material
ContentCategory.VOLATILE: timedelta(minutes=30), # Frequently updated
}

async def get_or_fetch(
self,
url: str,
category: ContentCategory,
force_refresh: bool = False
) -> CachedContent:
"""Read-through cache with TTL-based invalidation."""
...

2.2.3 Audit Trail

Responsibility: Records all URL Context usage for compliance evidence. Required for FDA 21 CFR Part 11, HIPAA, and SOC2 audit trails.

@dataclass(frozen=True)
class AuditRecord:
"""Immutable audit record for compliance."""
record_id: str
timestamp: datetime
agent_id: str
agent_role: AgentRole
urls_requested: List[str]
urls_retrieved: List[str]
retrieval_statuses: Dict[str, str]
model_used: str
tokens_consumed: TokenUsage
cache_hit: bool
prompt_hash: str # SHA-256 of the analysis prompt
response_hash: str # SHA-256 of the generated response
compliance_context: Optional[str] # e.g., "FDA 21 CFR Part 11.10(e)"

2.3 Agent Integration Patterns

2.3.1 Researcher Agent with URL Context

researcher_task = SubagentTask(
objective="Research FDA 21 CFR Part 11 electronic signature requirements",
output_format={
"requirements": ["list of requirements"],
"source_urls": ["urls referenced"],
"confidence": "high|medium|low",
"last_verified": "ISO timestamp"
},
tool_priorities=["url_context", "google_search"],
boundaries=[
"Only use authoritative sources (.gov, .org)",
"Include URL retrieval metadata in response",
"Flag any content that may be outdated"
],
effort_budget=5,
success_criteria=[
"All requirements cited with source URLs",
"Retrieval status verified for each source"
],
role=AgentRole.RESEARCHER,
regulatory_frameworks=["fda_21cfr11"],
checkpoint_required=True
)

2.3.2 Orchestrator-Workers with URL Context

class WebIntelligenceOrchestrator:
"""Orchestrator pattern for multi-URL research tasks."""

async def research_and_analyze(
self,
research_topic: str,
authoritative_urls: List[str],
analysis_depth: str = "comprehensive"
) -> ResearchReport:

# Phase 1: Parallel URL fetching (Worker pattern)
fetch_tasks = [
self.url_adapter.fetch_and_analyze(
url=url,
prompt=f"Extract all information relevant to: {research_topic}",
compliance_mode=True
)
for url in authoritative_urls
]
results = await asyncio.gather(*fetch_tasks)

# Phase 2: Synthesis (Orchestrator)
synthesis = await self.synthesize_research(
topic=research_topic,
source_analyses=results,
depth=analysis_depth
)

# Phase 3: Quality validation (Evaluator-Optimizer)
validated = await self.validate_and_refine(
synthesis=synthesis,
source_metadata=[r.url_metadata for r in results]
)

return validated

3. Data Flow Design

3.1 Standard Request Flow

1. Agent receives task requiring web research
2. Agent constructs URLContextRequest
3. Web Intelligence Layer:
a. Check local FoundationDB cache
b. If cache hit AND within TTL → Return cached content
c. If cache miss OR expired:
i. Build REST API request to Gemini
ii. Include url_context tool configuration
iii. Submit request with retry logic
iv. Parse response + url_context_metadata
v. Store in cache with appropriate TTL
vi. Generate audit record (if compliance_mode)
4. Return URLContextResponse to agent
5. Agent incorporates grounded content into task output

3.2 Compliance Document Research Flow

1. Compliance Agent receives regulatory research task
2. Identify authoritative source URLs (FDA.gov, HHS.gov, etc.)
3. For each URL:
a. Fetch via URL Context with compliance_mode=True
b. Verify retrieval status is SUCCESS
c. Generate immutable audit record with:
- Source URL
- Retrieval timestamp
- Content hash
- Regulatory context
4. Synthesize compliance requirements from fetched content
5. Cross-reference with existing compliance database
6. Generate compliance report with full citation chain
7. Store complete audit trail in FoundationDB
8. Trigger human checkpoint for compliance gate review

4. Error Handling Design

4.1 Error Categories

ErrorDetectionRecovery
URL retrieval failureurl_retrieval_status: ERRORRetry with backoff; fallback to Google Search
Rate limit exceededHTTP 429Exponential backoff; queue requests
Content too large34MB limit exceededSplit request; fetch subsections
TimeoutNo response within timeoutRetry; increase timeout; flag to orchestrator
Invalid URLMalformed URL in requestValidate before submission; reject with clear error
Model unavailableHTTP 503Failover to alternate model; queue for retry

4.2 Circuit Breaker Configuration

url_context_breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 consecutive failures
recovery_timeout=120.0, # Try again after 2 minutes
half_open_requests=2 # Allow 2 test requests before closing
)

5. Security Design

5.1 URL Allowlisting

For compliance-critical workflows, restrict fetchable URLs to approved domains:

COMPLIANCE_APPROVED_DOMAINS = [
"fda.gov", "www.fda.gov",
"hhs.gov", "www.hhs.gov",
"nist.gov", "www.nist.gov",
"aicpa-cima.com", # SOC2 standards
"iso.org", # ISO standards
]

async def validate_url(url: str, compliance_mode: bool) -> bool:
parsed = urlparse(url)
if compliance_mode:
return parsed.hostname in COMPLIANCE_APPROVED_DOMAINS
return parsed.scheme in ("http", "https")

5.2 Content Sanitization

All fetched content passes through sanitization before agent processing:

  • Strip potential injection patterns from fetched HTML
  • Validate JSON/XML structure before parsing
  • Log content hashes for integrity verification
  • Quarantine unexpected content types

6. Monitoring & Observability

6.1 Metrics

MetricTypeAlert Threshold
url_context.request_countCounterN/A
url_context.cache_hit_ratioGauge< 30% (investigate)
url_context.retrieval_success_rateGauge< 90% (alert)
url_context.latency_p99Histogram> 15s (alert)
url_context.token_consumptionCounter> budget threshold
url_context.circuit_breaker_opensCounter> 3/hour (alert)

6.2 Structured Logging

{
"event": "url_context_request",
"agent_id": "researcher-001",
"urls_requested": 3,
"urls_succeeded": 3,
"cache_hits": 1,
"total_tokens": 4521,
"latency_ms": 2340,
"model": "gemini-2.5-flash",
"compliance_mode": true
}

7. Testing Strategy

7.1 Unit Tests

  • URL validation logic
  • Cache TTL calculations
  • Audit record generation
  • Error classification and recovery selection

7.2 Integration Tests

  • End-to-end URL Context API calls (with test URLs)
  • Cache read-through behavior
  • Circuit breaker state transitions
  • Retry logic under simulated failures

7.3 Compliance Tests

  • Audit trail completeness verification
  • URL allowlist enforcement
  • Content hash integrity checks
  • Regulatory document retrieval success rates

8. Deployment Considerations

8.1 Environment Configuration

web_intelligence:
gemini_api:
endpoint: "https://generativelanguage.googleapis.com/v1beta"
preferred_model: "gemini-2.5-flash"
compliance_model: "gemini-2.5-pro" # Higher quality for regulatory
max_retries: 3
timeout_seconds: 30
rate_limit_buffer: 0.8 # Use 80% of available rate limit

cache:
backend: "foundationdb"
namespace: "web_intelligence"
default_ttl_hours: 6
compliance_ttl_hours: 24
max_cached_size_mb: 100

audit:
enabled: true
retention_days: 2555 # 7 years for FDA compliance
storage: "foundationdb"

Document maintained by Coditect Architecture Team
Review cycle: Monthly or on significant API changes