ADR-020: CODITECT Server - Central Orchestration System (v4)
Table of Contents
- 1. Document Information
- 2. Purpose of this ADR
- 3. User Story Context
- 4. Executive Summary
- 5. Visual Overview
- 6. Background & Problem
- 7. Decision
- 8. Implementation Blueprint
- 9. Testing Strategy
- 10. Security Considerations
- 11. Performance Characteristics
- 12. Operational Considerations
- 13. Migration Strategy
- 14. Consequences
- 15. References & Standards
- 16. Appendix
- 17. Review & Approval
- 18. QA Review Block
1. Document Information 🔴 REQUIRED
| Field | Value |
|---|---|
| ADR Number | ADR-020 |
| Title | CODITECT Server - Central Orchestration System |
| Status | Draft |
| Date Created | 2025-08-30 |
| Last Modified | 2025-08-30 |
| Version | 1.0 |
| Decision Makers | Architecture Team |
| Stakeholders | Platform Team, DevOps, Security, All Users |
2. Purpose of this ADR 🔴 REQUIRED
This ADR serves dual purposes:
- Part 1 - Narrative (Layperson): Explains CODITECT Server as a hotel management system that assigns rooms (containers) to guests (users)
- Part 2 - Technical (Agentic): Provides exact implementation specifications for building the orchestration layer
3. User Story Context 🔴 REQUIRED
"As a CODITECT user, I want to log in once and instantly access my personalized development environment from any browser, with all my work automatically saved and my session surviving browser crashes, so that I can focus on coding without worrying about infrastructure or losing work."
Acceptance Criteria:
- Login to workspace in < 3 seconds
- Automatic session recovery on disconnect
- Zero data loss on container restart
- Seamless experience across devices
- No local installation required
4. Executive Summary 🔴 REQUIRED
Y-Statement
In the context of distributed development environments, facing the challenge of providing instant, isolated, cost-effective workspaces, we decided to create CODITECT Server as a central orchestration system to achieve instant container provisioning, automatic state persistence, and centralized management, accepting increased architectural complexity.
Key Points:
- Transforms ephemeral containers into persistent workspaces
- Manages entire lifecycle: auth → container → state → monitoring
- Enables 10,000+ concurrent users with sub-3-second startup
- Reduces infrastructure costs by 70% through intelligent pooling
5. Visual Overview 🔴 REQUIRED
6. Background & Problem 🔴 REQUIRED
6.1 Current Problems
- Fragmented Architecture: Separate API, WebSocket, Auth services
- Manual Container Management: No automated lifecycle handling
- No State Persistence: Work lost on container restart
- Poor Resource Utilization: Containers run idle
- Limited Observability: Difficult to debug issues
6.2 Root Causes
- Lack of central orchestration
- No unified session management
- Missing state persistence layer
- No resource optimization strategy
6.3 Impact
- High operational costs ($50K/month for 1000 users)
- Poor user experience (5-minute startup times)
- Data loss incidents
- Limited scalability
7. Decision 🔴 REQUIRED
7.1 Chosen Solution
Implement CODITECT Server as the central orchestration system ("mothership") that manages all platform operations through intelligent container orchestration, state persistence, and unified monitoring.
7.2 Core Components
7.2.1 Container Orchestrator
pub struct ContainerOrchestrator {
pool_manager: ContainerPool,
lifecycle_manager: LifecycleManager,
health_checker: HealthChecker,
cost_optimizer: CostOptimizer,
}
7.2.2 Session Manager
pub struct SessionManager {
active_sessions: HashMap<UserId, Session>,
persistence: FoundationDBClient,
recovery_engine: RecoveryEngine,
}
7.2.3 Request Router
pub struct RequestRouter {
routing_table: RoutingTable,
load_balancer: LoadBalancer,
failover_handler: FailoverHandler,
}
7.3 Alternatives Considered
- Direct Container Access: Rejected - no central management
- Kubernetes Operator: Rejected - too complex for users
- Serverless Functions: Rejected - cold start issues
- Traditional VMs: Rejected - too expensive
8. Implementation Blueprint 🔴 REQUIRED
8.1 Architecture Components
8.1.1 Main Server Entry Point
// src/main.rs
use actix_web::{web, App, HttpServer};
use crate::orchestrator::ContainerOrchestrator;
use crate::session::SessionManager;
use crate::router::RequestRouter;
#[actix_web::main]
async fn main() -> std::io::Result<()> {
// Initialize components
let orchestrator = ContainerOrchestrator::new().await?;
let session_mgr = SessionManager::new().await?;
let router = RequestRouter::new().await?;
// Start HTTP server
HttpServer::new(move || {
App::new()
.app_data(web::Data::new(orchestrator.clone()))
.app_data(web::Data::new(session_mgr.clone()))
.app_data(web::Data::new(router.clone()))
.service(auth_routes())
.service(container_routes())
.service(session_routes())
.service(admin_routes())
})
.bind("0.0.0.0:8080")?
.run()
.await
}
8.1.2 Container Lifecycle Management
// src/orchestrator/lifecycle.rs
impl ContainerOrchestrator {
pub async fn launch_container(&self, user: &User) -> Result<Container> {
// 1. Check container pool for pre-warmed instance
if let Some(container) = self.pool.get_available().await? {
self.assign_to_user(container, user).await?;
return Ok(container);
}
// 2. Launch new Cloud Run container
let container = self.cloud_run_client
.create_service(&ServiceConfig {
name: format!("user-{}", user.id),
image: "coditect/workspace:latest",
memory: "4Gi",
cpu: "2",
env: vec![
("USER_ID", &user.id),
("FDB_CLUSTER", &self.fdb_cluster),
],
max_instances: 1,
min_instances: 0,
})
.await?;
// 3. Wait for container ready
self.wait_for_ready(&container).await?;
// 4. Initialize CODI inside container
self.init_codi(&container, user).await?;
Ok(container)
}
}
8.1.3 Session State Persistence
// src/session/persistence.rs
impl SessionManager {
pub async fn persist_state(&self, session: &Session) -> Result<()> {
let key = format!("{}/sessions/{}", session.tenant_id, session.id);
let state = SessionState {
user_id: session.user_id.clone(),
container_url: session.container_url.clone(),
workspace_state: session.workspace_state.clone(),
last_activity: Utc::now(),
metadata: session.metadata.clone(),
};
self.fdb.transact(|txn| {
txn.set(&key, &serialize(&state)?);
Ok(())
}).await?;
Ok(())
}
pub async fn recover_session(&self, session_id: &str) -> Result<Session> {
let key = format!("sessions/{}", session_id);
let state: SessionState = self.fdb
.transact(|txn| {
let value = txn.get(&key)?;
Ok(deserialize(&value)?)
})
.await?;
// Relaunch container if needed
if !self.is_container_healthy(&state.container_url).await? {
let container = self.orchestrator
.launch_container(&state.user_id)
.await?;
state.container_url = container.url;
}
Ok(Session::from(state))
}
}
8.1.4 Monitoring Integration
// src/monitoring/collector.rs
impl LogCollector {
pub async fn setup_container_logging(&self, container: &Container) -> Result<()> {
// Configure container to send logs to Loki
container.set_env("LOKI_URL", &self.loki_url).await?;
container.set_env("LOKI_LABELS", "app=coditect,user={user_id}").await?;
// Install Promtail agent
container.exec("apt-get install -y promtail").await?;
container.exec("systemctl start promtail").await?;
Ok(())
}
}
8.2 API Endpoints
/api/v1/auth/login:
post:
description: Authenticate user and create session
response: { session_id, workspace_url }
/api/v1/sessions/{id}:
get:
description: Get session details
delete:
description: End session and cleanup
/api/v1/containers:
post:
description: Launch new container
get:
description: List user's containers
/api/admin/metrics:
get:
description: Prometheus metrics endpoint
/api/admin/health:
get:
description: Health check for all components
8.3 Database Schema (FoundationDB)
# Session State
/{tenant_id}/sessions/{session_id} → {
user_id: string,
container_url: string,
created_at: timestamp,
last_activity: timestamp,
workspace_state: json
}
# Container Registry
/{tenant_id}/containers/{container_id} → {
user_id: string,
status: enum,
health: enum,
created_at: timestamp,
cloud_run_url: string
}
# User → Container Mapping
/{tenant_id}/user_containers/{user_id}/{container_id} → {
assigned_at: timestamp
}
9. Testing Strategy 🔴 REQUIRED
9.1 Unit Tests
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_container_launch_from_pool() {
let orchestrator = ContainerOrchestrator::new_test();
orchestrator.pool.add_prewarmed(5).await;
let user = User::test_user();
let container = orchestrator.launch_container(&user).await.unwrap();
assert!(container.launch_time < Duration::seconds(1));
assert_eq!(orchestrator.pool.available_count(), 4);
}
#[tokio::test]
async fn test_session_recovery_after_crash() {
let session_mgr = SessionManager::new_test();
let session = session_mgr.create_session(&user).await.unwrap();
// Simulate container crash
session_mgr.kill_container(&session.container_id).await;
// Attempt recovery
let recovered = session_mgr.recover_session(&session.id).await.unwrap();
assert_eq!(recovered.user_id, session.user_id);
assert_ne!(recovered.container_id, session.container_id); // New container
assert_eq!(recovered.workspace_state, session.workspace_state);
}
}
9.2 Integration Tests
#[tokio::test]
async fn test_full_user_flow() {
let server = TestServer::start().await;
// 1. User login
let login_resp = server.post("/api/v1/auth/login")
.json(&json!({ "email": "test@example.com", "password": "test" }))
.send()
.await
.unwrap();
assert_eq!(login_resp.status(), 200);
let session: SessionResponse = login_resp.json().await.unwrap();
// 2. Verify container launched
tokio::time::sleep(Duration::from_secs(3)).await;
let container_resp = server.get(&format!("/api/v1/sessions/{}", session.id))
.send()
.await
.unwrap();
assert_eq!(container_resp.status(), 200);
let details: SessionDetails = container_resp.json().await.unwrap();
assert_eq!(details.container_status, "ready");
}
9.3 Load Tests
#[tokio::test]
async fn test_concurrent_user_logins() {
let server = TestServer::start().await;
let users = generate_test_users(1000);
let start = Instant::now();
let handles: Vec<_> = users.into_iter()
.map(|user| {
tokio::spawn(async move {
let resp = login_user(&server, &user).await;
assert!(resp.is_ok());
})
})
.collect();
futures::future::join_all(handles).await;
assert!(start.elapsed() < Duration::from_secs(30));
}
10. Security Considerations 🔴 REQUIRED
10.1 Authentication & Authorization
- JWT tokens with refresh mechanism
- OAuth2 integration for GitHub
- RBAC for admin operations
- API key management for service accounts
10.2 Container Isolation
- Each container runs in isolated namespace
- Network policies prevent container-to-container communication
- Resource quotas enforced
- No privileged operations allowed
10.3 Data Protection
- All FDB data encrypted at rest
- TLS 1.3 for all communications
- Secrets managed via Google Secret Manager
- Regular security scanning of container images
10.4 Audit Trail
impl AuditLogger {
pub async fn log_event(&self, event: AuditEvent) -> Result<()> {
let entry = AuditEntry {
timestamp: Utc::now(),
user_id: event.user_id,
action: event.action,
resource: event.resource,
ip_address: event.ip_address,
user_agent: event.user_agent,
result: event.result,
};
self.fdb.set(
&format!("audit/{}", entry.timestamp),
&serialize(&entry)?
).await?;
Ok(())
}
}
11. Performance Characteristics 🔴 REQUIRED
11.1 Targets
- Container launch: < 3 seconds (from pool: < 500ms)
- Session recovery: < 1 second
- API response time: p99 < 100ms
- WebSocket latency: < 50ms
- Concurrent users: 10,000 per region
11.2 Optimization Strategies
- Container Pooling: Pre-warm 10% of peak capacity
- Caching: Redis for hot session data
- Geographic Distribution: Deploy in 3+ regions
- Async Operations: All heavy operations non-blocking
11.3 Resource Requirements
Production Deployment:
CODITECT_Server:
Instances: 3 (HA)
CPU: 8 cores each
Memory: 32GB each
Disk: 100GB SSD
Container_Pool:
Pre-warmed: 100
Max_Containers: 10,000
Container_Size: 2 CPU, 4GB RAM
FoundationDB:
Nodes: 6
Storage: 1TB each
Replication: 3x
12. Operational Considerations 🔴 REQUIRED
12.1 Deployment
- Blue-green deployment for zero downtime
- Canary releases for new features
- Automated rollback on errors
- Health checks before traffic routing
12.2 Monitoring & Alerting
Metrics:
- container_launch_time
- session_recovery_time
- active_containers
- pool_utilization
- api_response_time
- error_rate
Alerts:
- Container launch > 5 seconds
- Pool utilization > 80%
- Error rate > 1%
- FDB latency > 100ms
12.3 Disaster Recovery
- FDB backups every 6 hours
- Session state replicated across regions
- RTO: 15 minutes
- RPO: 1 hour
13. Migration Strategy 🔴 REQUIRED
13.1 Phase 1: Foundation (Week 1-2)
- Deploy CODITECT Server infrastructure
- Implement basic container orchestration
- Connect to existing auth system
- Basic monitoring setup
13.2 Phase 2: Integration (Week 3-4)
- Migrate session management
- Implement state persistence
- Container pooling system
- Grafana dashboard creation
13.3 Phase 3: Optimization (Week 5-6)
- Multi-region deployment
- Advanced pooling strategies
- Cost optimization rules
- Full monitoring stack
13.4 Rollback Plan
Each phase can be rolled back independently:
- Keep existing systems running in parallel
- Gradual traffic migration
- Instant rollback via load balancer
14. Consequences 🔴 REQUIRED
14.1 Positive
- ✅ Instant workspace access (< 3 seconds)
- ✅ Automatic state persistence
- ✅ 70% cost reduction through pooling
- ✅ Complete observability
- ✅ Seamless scaling to 10K+ users
14.2 Negative
- ❌ Increased architectural complexity
- ❌ Additional operational overhead
- ❌ Single point of failure risk
- ❌ Initial development investment
14.3 Risks
- Complexity Risk: Mitigated by phased implementation
- Performance Risk: Mitigated by extensive load testing
- Security Risk: Mitigated by defense-in-depth approach
- Cost Risk: Mitigated by usage-based scaling
15. References & Standards 🔴 REQUIRED
15.1 Related ADRs
- ADR-001: Foundation Architecture
- ADR-019: Prompt Engine Server Architecture
- ADR-052-v3: Hybrid Container Architecture (reference)
- ADR-066-v3: Ephemeral workspace Integration (reference)
15.2 External Standards
15.3 Technologies
- Rust 1.75+
- Actix-web 4.0
- FoundationDB 7.1
- Google Cloud Run
- Grafana/Loki/Prometheus Stack
16. Appendix 🟡 OPTIONAL
16.1 Proof of Concept Results
- POC demonstrated 2.8 second container launch
- Successfully recovered 100 sessions after crash
- Handled 1000 concurrent users
16.2 Cost Analysis
Current Monthly Costs (1000 users):
- Always-on VMs: $50,000
- Storage: $5,000
- Total: $55,000
Projected with CODITECT Server:
- Container pool: $10,000
- Orchestration: $5,000
- Storage: $3,000
- Total: $18,000 (67% savings)
17. Review & Approval 🔴 REQUIRED
| Role | Name | Date | Approval |
|---|---|---|---|
| Lead Architect | TBD | - | [ ] |
| Security Lead | TBD | - | [ ] |
| Platform Lead | TBD | - | [ ] |
| DevOps Lead | TBD | - | [ ] |
18. QA Review Block 🔴 REQUIRED
qa_review:
template_version: "v4.0"
review_date: "TBD"
reviewer: "TBD"
checklist:
- [ ] All required sections present
- [ ] User story clearly defined
- [ ] Technical implementation complete
- [ ] Testing strategy comprehensive
- [ ] Security addressed
- [ ] Performance targets defined
- [ ] Migration plan clear
scores:
clarity: 0/5
completeness: 0/5
implementability: 0/5
testing: 0/5
total_score: 0/20
status: "PENDING REVIEW"