Software Design Document (SDD): Coditect Web Intelligence Layer
Document ID: SDD-2026-0204-003
Date: February 4, 2026
Version: 1.0
Status: Draft
Authors: Coditect Architecture Team
1. Introduction
1.1 Purpose
This SDD defines the design for integrating Gemini API URL Context capabilities into the Coditect autonomous development platform. The integration creates a Web Intelligence Layer that enables Coditect's multi-agent system to autonomously research, analyze, and ground decisions in live web content.
1.2 Scope
| In Scope | Out of Scope |
|---|---|
| URL Context tool adapter for Coditect agents | Authenticated/private URL access |
| Compliance document research agent | Custom web scraping infrastructure |
| FoundationDB caching for fetched content | YouTube/video content processing |
| Audit trail for compliance grounding | Google Workspace document access |
| Multi-agent delegation patterns | Real-time streaming URL monitoring |
1.3 Referenced Documents
- ADR-2026-007: Adoption of Gemini URL Context for Web Intelligence
- ADR-2026-008: REST API vs SDK Strategy for Gemini Integration
- ADR-2026-009: Compliance Document Caching Architecture
- EXEC-2026-0204-001: Executive Summary
- TECH-2026-0204-002: Detailed Technical Analysis
2. System Architecture
2.1 High-Level Architecture
The Web Intelligence Layer sits between Coditect's Agent Orchestrator and the Gemini API, providing a unified interface for web content access with compliance-aware caching and audit capabilities.
┌─────────────────────────────────────────────────────────┐
│ CODITECT PLATFORM │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Orchestrator │ │ Researcher │ │ Compliance │ │
│ │ Agent │ │ Agents │ │ Agent │ │
│ └──────┬───────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ ┌──────┴──────────────────┴──────────────────┴──────┐ │
│ │ WEB INTELLIGENCE LAYER │ │
│ │ │ │
│ │ ┌─────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ URL Context │ │ Content │ │ Audit │ │ │
│ │ │ Adapter │ │ Cache │ │ Trail │ │ │
│ │ └──────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────┴───────────────┴───────────────┴──────┐ │ │
│ │ │ FoundationDB State Store │ │ │
│ │ └─────────────────────┬───────────────────────┘ │ │
│ └────────────────────────┼──────────────────────────┘ │
└───────────────────────────┼─────────────────────────────┘
│
┌────────┴────────┐
│ Gemini API │
│ URL Context + │
│ Google Search │
└─────────────────┘
2.2 Component Design
2.2.1 URL Context Adapter
Responsibility: Wraps the Gemini API URL Context tool into Coditect's standardized tool interface.
@dataclass(frozen=True)
class URLContextRequest:
"""Request to fetch and analyze URL content."""
urls: List[str] # 1-20 URLs
analysis_prompt: str # What to extract/analyze
model: str = "gemini-2.5-flash" # Model selection
combine_with_search: bool = False # Also enable Google Search
compliance_mode: bool = False # Enable audit trail
cache_strategy: CacheStrategy = CacheStrategy.READ_THROUGH
max_retries: int = 3
timeout_seconds: float = 30.0
@dataclass
class URLContextResponse:
"""Response from URL Context analysis."""
content: str # Generated analysis
url_metadata: List[URLMetadata] # Retrieval status per URL
token_usage: TokenUsage # Input/output tokens consumed
cached: bool # Whether result was from local cache
audit_record: Optional[AuditRecord] # Compliance audit trail
timestamp: datetime
2.2.2 Content Cache
Responsibility: FoundationDB-backed cache to reduce redundant API calls and maintain compliance document snapshots.
class ContentCache:
"""FoundationDB-backed URL content cache with TTL management."""
NAMESPACE = "web_intelligence"
# Cache TTL by content category
TTL_MAP = {
ContentCategory.REGULATORY: timedelta(hours=24), # FDA/HIPAA docs
ContentCategory.DOCUMENTATION: timedelta(hours=6), # API docs
ContentCategory.NEWS: timedelta(hours=1), # News/blog posts
ContentCategory.STATIC: timedelta(days=7), # Reference material
ContentCategory.VOLATILE: timedelta(minutes=30), # Frequently updated
}
async def get_or_fetch(
self,
url: str,
category: ContentCategory,
force_refresh: bool = False
) -> CachedContent:
"""Read-through cache with TTL-based invalidation."""
...
2.2.3 Audit Trail
Responsibility: Records all URL Context usage for compliance evidence. Required for FDA 21 CFR Part 11, HIPAA, and SOC2 audit trails.
@dataclass(frozen=True)
class AuditRecord:
"""Immutable audit record for compliance."""
record_id: str
timestamp: datetime
agent_id: str
agent_role: AgentRole
urls_requested: List[str]
urls_retrieved: List[str]
retrieval_statuses: Dict[str, str]
model_used: str
tokens_consumed: TokenUsage
cache_hit: bool
prompt_hash: str # SHA-256 of the analysis prompt
response_hash: str # SHA-256 of the generated response
compliance_context: Optional[str] # e.g., "FDA 21 CFR Part 11.10(e)"
2.3 Agent Integration Patterns
2.3.1 Researcher Agent with URL Context
researcher_task = SubagentTask(
objective="Research FDA 21 CFR Part 11 electronic signature requirements",
output_format={
"requirements": ["list of requirements"],
"source_urls": ["urls referenced"],
"confidence": "high|medium|low",
"last_verified": "ISO timestamp"
},
tool_priorities=["url_context", "google_search"],
boundaries=[
"Only use authoritative sources (.gov, .org)",
"Include URL retrieval metadata in response",
"Flag any content that may be outdated"
],
effort_budget=5,
success_criteria=[
"All requirements cited with source URLs",
"Retrieval status verified for each source"
],
role=AgentRole.RESEARCHER,
regulatory_frameworks=["fda_21cfr11"],
checkpoint_required=True
)
2.3.2 Orchestrator-Workers with URL Context
class WebIntelligenceOrchestrator:
"""Orchestrator pattern for multi-URL research tasks."""
async def research_and_analyze(
self,
research_topic: str,
authoritative_urls: List[str],
analysis_depth: str = "comprehensive"
) -> ResearchReport:
# Phase 1: Parallel URL fetching (Worker pattern)
fetch_tasks = [
self.url_adapter.fetch_and_analyze(
url=url,
prompt=f"Extract all information relevant to: {research_topic}",
compliance_mode=True
)
for url in authoritative_urls
]
results = await asyncio.gather(*fetch_tasks)
# Phase 2: Synthesis (Orchestrator)
synthesis = await self.synthesize_research(
topic=research_topic,
source_analyses=results,
depth=analysis_depth
)
# Phase 3: Quality validation (Evaluator-Optimizer)
validated = await self.validate_and_refine(
synthesis=synthesis,
source_metadata=[r.url_metadata for r in results]
)
return validated
3. Data Flow Design
3.1 Standard Request Flow
1. Agent receives task requiring web research
2. Agent constructs URLContextRequest
3. Web Intelligence Layer:
a. Check local FoundationDB cache
b. If cache hit AND within TTL → Return cached content
c. If cache miss OR expired:
i. Build REST API request to Gemini
ii. Include url_context tool configuration
iii. Submit request with retry logic
iv. Parse response + url_context_metadata
v. Store in cache with appropriate TTL
vi. Generate audit record (if compliance_mode)
4. Return URLContextResponse to agent
5. Agent incorporates grounded content into task output
3.2 Compliance Document Research Flow
1. Compliance Agent receives regulatory research task
2. Identify authoritative source URLs (FDA.gov, HHS.gov, etc.)
3. For each URL:
a. Fetch via URL Context with compliance_mode=True
b. Verify retrieval status is SUCCESS
c. Generate immutable audit record with:
- Source URL
- Retrieval timestamp
- Content hash
- Regulatory context
4. Synthesize compliance requirements from fetched content
5. Cross-reference with existing compliance database
6. Generate compliance report with full citation chain
7. Store complete audit trail in FoundationDB
8. Trigger human checkpoint for compliance gate review
4. Error Handling Design
4.1 Error Categories
| Error | Detection | Recovery |
|---|---|---|
| URL retrieval failure | url_retrieval_status: ERROR | Retry with backoff; fallback to Google Search |
| Rate limit exceeded | HTTP 429 | Exponential backoff; queue requests |
| Content too large | 34MB limit exceeded | Split request; fetch subsections |
| Timeout | No response within timeout | Retry; increase timeout; flag to orchestrator |
| Invalid URL | Malformed URL in request | Validate before submission; reject with clear error |
| Model unavailable | HTTP 503 | Failover to alternate model; queue for retry |
4.2 Circuit Breaker Configuration
url_context_breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 consecutive failures
recovery_timeout=120.0, # Try again after 2 minutes
half_open_requests=2 # Allow 2 test requests before closing
)
5. Security Design
5.1 URL Allowlisting
For compliance-critical workflows, restrict fetchable URLs to approved domains:
COMPLIANCE_APPROVED_DOMAINS = [
"fda.gov", "www.fda.gov",
"hhs.gov", "www.hhs.gov",
"nist.gov", "www.nist.gov",
"aicpa-cima.com", # SOC2 standards
"iso.org", # ISO standards
]
async def validate_url(url: str, compliance_mode: bool) -> bool:
parsed = urlparse(url)
if compliance_mode:
return parsed.hostname in COMPLIANCE_APPROVED_DOMAINS
return parsed.scheme in ("http", "https")
5.2 Content Sanitization
All fetched content passes through sanitization before agent processing:
- Strip potential injection patterns from fetched HTML
- Validate JSON/XML structure before parsing
- Log content hashes for integrity verification
- Quarantine unexpected content types
6. Monitoring & Observability
6.1 Metrics
| Metric | Type | Alert Threshold |
|---|---|---|
url_context.request_count | Counter | N/A |
url_context.cache_hit_ratio | Gauge | < 30% (investigate) |
url_context.retrieval_success_rate | Gauge | < 90% (alert) |
url_context.latency_p99 | Histogram | > 15s (alert) |
url_context.token_consumption | Counter | > budget threshold |
url_context.circuit_breaker_opens | Counter | > 3/hour (alert) |
6.2 Structured Logging
{
"event": "url_context_request",
"agent_id": "researcher-001",
"urls_requested": 3,
"urls_succeeded": 3,
"cache_hits": 1,
"total_tokens": 4521,
"latency_ms": 2340,
"model": "gemini-2.5-flash",
"compliance_mode": true
}
7. Testing Strategy
7.1 Unit Tests
- URL validation logic
- Cache TTL calculations
- Audit record generation
- Error classification and recovery selection
7.2 Integration Tests
- End-to-end URL Context API calls (with test URLs)
- Cache read-through behavior
- Circuit breaker state transitions
- Retry logic under simulated failures
7.3 Compliance Tests
- Audit trail completeness verification
- URL allowlist enforcement
- Content hash integrity checks
- Regulatory document retrieval success rates
8. Deployment Considerations
8.1 Environment Configuration
web_intelligence:
gemini_api:
endpoint: "https://generativelanguage.googleapis.com/v1beta"
preferred_model: "gemini-2.5-flash"
compliance_model: "gemini-2.5-pro" # Higher quality for regulatory
max_retries: 3
timeout_seconds: 30
rate_limit_buffer: 0.8 # Use 80% of available rate limit
cache:
backend: "foundationdb"
namespace: "web_intelligence"
default_ttl_hours: 6
compliance_ttl_hours: 24
max_cached_size_mb: 100
audit:
enabled: true
retention_days: 2555 # 7 years for FDA compliance
storage: "foundationdb"
Document maintained by Coditect Architecture Team
Review cycle: Monthly or on significant API changes