Software Design Document (SDD): Coditect Web Intelligence Layer

Document ID: SDD-2026-0204-003
Date: February 4, 2026
Version: 1.0
Status: Draft
Authors: Coditect Architecture Team

1. Introduction

1.1 Purpose

This SDD defines the design for integrating Gemini API URL Context capabilities into the Coditect autonomous development platform. The integration creates a Web Intelligence Layer that enables Coditect's multi-agent system to autonomously research, analyze, and ground decisions in live web content.

1.2 Scope

In Scope	Out of Scope
URL Context tool adapter for Coditect agents	Authenticated/private URL access
Compliance document research agent	Custom web scraping infrastructure
FoundationDB caching for fetched content	YouTube/video content processing
Audit trail for compliance grounding	Google Workspace document access
Multi-agent delegation patterns	Real-time streaming URL monitoring

1.3 Referenced Documents

ADR-2026-007: Adoption of Gemini URL Context for Web Intelligence
ADR-2026-008: REST API vs SDK Strategy for Gemini Integration
ADR-2026-009: Compliance Document Caching Architecture
EXEC-2026-0204-001: Executive Summary
TECH-2026-0204-002: Detailed Technical Analysis

2. System Architecture

2.1 High-Level Architecture

The Web Intelligence Layer sits between Coditect's Agent Orchestrator and the Gemini API, providing a unified interface for web content access with compliance-aware caching and audit capabilities.

┌─────────────────────────────────────────────────────────┐
│                  CODITECT PLATFORM                       │
│                                                         │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────┐  │
│  │ Orchestrator  │   │  Researcher  │   │ Compliance │  │
│  │    Agent      │   │   Agents     │   │   Agent    │  │
│  └──────┬───────┘   └──────┬───────┘   └─────┬──────┘  │
│         │                  │                  │         │
│  ┌──────┴──────────────────┴──────────────────┴──────┐  │
│  │          WEB INTELLIGENCE LAYER                    │  │
│  │                                                    │  │
│  │  ┌─────────────┐  ┌────────────┐  ┌────────────┐  │  │
│  │  │ URL Context  │  │  Content   │  │   Audit    │  │  │
│  │  │   Adapter    │  │   Cache    │  │   Trail    │  │  │
│  │  └──────┬──────┘  └─────┬──────┘  └─────┬──────┘  │  │
│  │         │               │               │         │  │
│  │  ┌──────┴───────────────┴───────────────┴──────┐  │  │
│  │  │           FoundationDB State Store           │  │  │
│  │  └─────────────────────┬───────────────────────┘  │  │
│  └────────────────────────┼──────────────────────────┘  │
└───────────────────────────┼─────────────────────────────┘
                            │
                   ┌────────┴────────┐
                   │   Gemini API    │
                   │  URL Context +  │
                   │ Google Search   │
                   └─────────────────┘

2.2 Component Design

2.2.1 URL Context Adapter

Responsibility: Wraps the Gemini API URL Context tool into Coditect's standardized tool interface.

@dataclass(frozen=True)
class URLContextRequest:
    """Request to fetch and analyze URL content."""
    urls: List[str]              # 1-20 URLs
    analysis_prompt: str          # What to extract/analyze
    model: str = "gemini-2.5-flash"  # Model selection
    combine_with_search: bool = False  # Also enable Google Search
    compliance_mode: bool = False      # Enable audit trail
    cache_strategy: CacheStrategy = CacheStrategy.READ_THROUGH
    max_retries: int = 3
    timeout_seconds: float = 30.0

@dataclass
class URLContextResponse:
    """Response from URL Context analysis."""
    content: str                          # Generated analysis
    url_metadata: List[URLMetadata]       # Retrieval status per URL
    token_usage: TokenUsage               # Input/output tokens consumed
    cached: bool                          # Whether result was from local cache
    audit_record: Optional[AuditRecord]   # Compliance audit trail
    timestamp: datetime

2.2.2 Content Cache

Responsibility: FoundationDB-backed cache to reduce redundant API calls and maintain compliance document snapshots.

class ContentCache:
    """FoundationDB-backed URL content cache with TTL management."""
    
    NAMESPACE = "web_intelligence"
    
    # Cache TTL by content category
    TTL_MAP = {
        ContentCategory.REGULATORY: timedelta(hours=24),   # FDA/HIPAA docs
        ContentCategory.DOCUMENTATION: timedelta(hours=6),  # API docs
        ContentCategory.NEWS: timedelta(hours=1),           # News/blog posts
        ContentCategory.STATIC: timedelta(days=7),          # Reference material
        ContentCategory.VOLATILE: timedelta(minutes=30),    # Frequently updated
    }
    
    async def get_or_fetch(
        self,
        url: str,
        category: ContentCategory,
        force_refresh: bool = False
    ) -> CachedContent:
        """Read-through cache with TTL-based invalidation."""
        ...

2.2.3 Audit Trail

Responsibility: Records all URL Context usage for compliance evidence. Required for FDA 21 CFR Part 11, HIPAA, and SOC2 audit trails.

@dataclass(frozen=True)
class AuditRecord:
    """Immutable audit record for compliance."""
    record_id: str
    timestamp: datetime
    agent_id: str
    agent_role: AgentRole
    urls_requested: List[str]
    urls_retrieved: List[str]
    retrieval_statuses: Dict[str, str]
    model_used: str
    tokens_consumed: TokenUsage
    cache_hit: bool
    prompt_hash: str         # SHA-256 of the analysis prompt
    response_hash: str       # SHA-256 of the generated response
    compliance_context: Optional[str]  # e.g., "FDA 21 CFR Part 11.10(e)"

2.3 Agent Integration Patterns

2.3.1 Researcher Agent with URL Context

researcher_task = SubagentTask(
    objective="Research FDA 21 CFR Part 11 electronic signature requirements",
    output_format={
        "requirements": ["list of requirements"],
        "source_urls": ["urls referenced"],
        "confidence": "high|medium|low",
        "last_verified": "ISO timestamp"
    },
    tool_priorities=["url_context", "google_search"],
    boundaries=[
        "Only use authoritative sources (.gov, .org)",
        "Include URL retrieval metadata in response",
        "Flag any content that may be outdated"
    ],
    effort_budget=5,
    success_criteria=[
        "All requirements cited with source URLs",
        "Retrieval status verified for each source"
    ],
    role=AgentRole.RESEARCHER,
    regulatory_frameworks=["fda_21cfr11"],
    checkpoint_required=True
)

2.3.2 Orchestrator-Workers with URL Context

class WebIntelligenceOrchestrator:
    """Orchestrator pattern for multi-URL research tasks."""
    
    async def research_and_analyze(
        self,
        research_topic: str,
        authoritative_urls: List[str],
        analysis_depth: str = "comprehensive"
    ) -> ResearchReport:
        
        # Phase 1: Parallel URL fetching (Worker pattern)
        fetch_tasks = [
            self.url_adapter.fetch_and_analyze(
                url=url,
                prompt=f"Extract all information relevant to: {research_topic}",
                compliance_mode=True
            )
            for url in authoritative_urls
        ]
        results = await asyncio.gather(*fetch_tasks)
        
        # Phase 2: Synthesis (Orchestrator)
        synthesis = await self.synthesize_research(
            topic=research_topic,
            source_analyses=results,
            depth=analysis_depth
        )
        
        # Phase 3: Quality validation (Evaluator-Optimizer)
        validated = await self.validate_and_refine(
            synthesis=synthesis,
            source_metadata=[r.url_metadata for r in results]
        )
        
        return validated

3. Data Flow Design

3.1 Standard Request Flow

1. Agent receives task requiring web research
2. Agent constructs URLContextRequest
3. Web Intelligence Layer:
   a. Check local FoundationDB cache
   b. If cache hit AND within TTL → Return cached content
   c. If cache miss OR expired:
      i.   Build REST API request to Gemini
      ii.  Include url_context tool configuration
      iii. Submit request with retry logic
      iv.  Parse response + url_context_metadata
      v.   Store in cache with appropriate TTL
      vi.  Generate audit record (if compliance_mode)
4. Return URLContextResponse to agent
5. Agent incorporates grounded content into task output

3.2 Compliance Document Research Flow

1. Compliance Agent receives regulatory research task
2. Identify authoritative source URLs (FDA.gov, HHS.gov, etc.)
3. For each URL:
   a. Fetch via URL Context with compliance_mode=True
   b. Verify retrieval status is SUCCESS
   c. Generate immutable audit record with:
      - Source URL
      - Retrieval timestamp
      - Content hash
      - Regulatory context
4. Synthesize compliance requirements from fetched content
5. Cross-reference with existing compliance database
6. Generate compliance report with full citation chain
7. Store complete audit trail in FoundationDB
8. Trigger human checkpoint for compliance gate review

4. Error Handling Design

4.1 Error Categories

Error	Detection	Recovery
URL retrieval failure	`url_retrieval_status: ERROR`	Retry with backoff; fallback to Google Search
Rate limit exceeded	HTTP 429	Exponential backoff; queue requests
Content too large	34MB limit exceeded	Split request; fetch subsections
Timeout	No response within timeout	Retry; increase timeout; flag to orchestrator
Invalid URL	Malformed URL in request	Validate before submission; reject with clear error
Model unavailable	HTTP 503	Failover to alternate model; queue for retry

4.2 Circuit Breaker Configuration

url_context_breaker = CircuitBreaker(
    failure_threshold=5,       # Open after 5 consecutive failures
    recovery_timeout=120.0,    # Try again after 2 minutes
    half_open_requests=2       # Allow 2 test requests before closing
)

5. Security Design

5.1 URL Allowlisting

For compliance-critical workflows, restrict fetchable URLs to approved domains:

COMPLIANCE_APPROVED_DOMAINS = [
    "fda.gov", "www.fda.gov",
    "hhs.gov", "www.hhs.gov",
    "nist.gov", "www.nist.gov",
    "aicpa-cima.com",           # SOC2 standards
    "iso.org",                   # ISO standards
]

async def validate_url(url: str, compliance_mode: bool) -> bool:
    parsed = urlparse(url)
    if compliance_mode:
        return parsed.hostname in COMPLIANCE_APPROVED_DOMAINS
    return parsed.scheme in ("http", "https")

5.2 Content Sanitization

All fetched content passes through sanitization before agent processing:

Strip potential injection patterns from fetched HTML
Validate JSON/XML structure before parsing
Log content hashes for integrity verification
Quarantine unexpected content types

6. Monitoring & Observability

6.1 Metrics

Metric	Type	Alert Threshold
`url_context.request_count`	Counter	N/A
`url_context.cache_hit_ratio`	Gauge	< 30% (investigate)
`url_context.retrieval_success_rate`	Gauge	< 90% (alert)
`url_context.latency_p99`	Histogram	> 15s (alert)
`url_context.token_consumption`	Counter	> budget threshold
`url_context.circuit_breaker_opens`	Counter	> 3/hour (alert)

6.2 Structured Logging

{
  "event": "url_context_request",
  "agent_id": "researcher-001",
  "urls_requested": 3,
  "urls_succeeded": 3,
  "cache_hits": 1,
  "total_tokens": 4521,
  "latency_ms": 2340,
  "model": "gemini-2.5-flash",
  "compliance_mode": true
}

7. Testing Strategy

7.1 Unit Tests

URL validation logic
Cache TTL calculations
Audit record generation
Error classification and recovery selection

7.2 Integration Tests

End-to-end URL Context API calls (with test URLs)
Cache read-through behavior
Circuit breaker state transitions
Retry logic under simulated failures

7.3 Compliance Tests

Audit trail completeness verification
URL allowlist enforcement
Content hash integrity checks
Regulatory document retrieval success rates

8. Deployment Considerations

8.1 Environment Configuration

web_intelligence:
  gemini_api:
    endpoint: "https://generativelanguage.googleapis.com/v1beta"
    preferred_model: "gemini-2.5-flash"
    compliance_model: "gemini-2.5-pro"  # Higher quality for regulatory
    max_retries: 3
    timeout_seconds: 30
    rate_limit_buffer: 0.8  # Use 80% of available rate limit
  
  cache:
    backend: "foundationdb"
    namespace: "web_intelligence"
    default_ttl_hours: 6
    compliance_ttl_hours: 24
    max_cached_size_mb: 100
  
  audit:
    enabled: true
    retention_days: 2555  # 7 years for FDA compliance
    storage: "foundationdb"

Document maintained by Coditect Architecture Team
Review cycle: Monthly or on significant API changes

1. Introduction​

1.1 Purpose​

1.2 Scope​

1.3 Referenced Documents​

2. System Architecture​

2.1 High-Level Architecture​

2.2 Component Design​

2.2.1 URL Context Adapter​

2.2.2 Content Cache​

2.2.3 Audit Trail​

2.3 Agent Integration Patterns​

2.3.1 Researcher Agent with URL Context​

2.3.2 Orchestrator-Workers with URL Context​

3. Data Flow Design​

3.1 Standard Request Flow​

3.2 Compliance Document Research Flow​

4. Error Handling Design​

4.1 Error Categories​

4.2 Circuit Breaker Configuration​

5. Security Design​

5.1 URL Allowlisting​

5.2 Content Sanitization​

6. Monitoring & Observability​

6.1 Metrics​

6.2 Structured Logging​

7. Testing Strategy​

7.1 Unit Tests​

7.2 Integration Tests​

7.3 Compliance Tests​

8. Deployment Considerations​

8.1 Environment Configuration​