Skip to main content

ADR-020: CODITECT Server - Central Orchestration System (v4)

Table of Contents

1. Document Information 🔴 REQUIRED

FieldValue
ADR NumberADR-020
TitleCODITECT Server - Central Orchestration System
StatusDraft
Date Created2025-08-30
Last Modified2025-08-30
Version1.0
Decision MakersArchitecture Team
StakeholdersPlatform Team, DevOps, Security, All Users

2. Purpose of this ADR 🔴 REQUIRED

This ADR serves dual purposes:

  • Part 1 - Narrative (Layperson): Explains CODITECT Server as a hotel management system that assigns rooms (containers) to guests (users)
  • Part 2 - Technical (Agentic): Provides exact implementation specifications for building the orchestration layer

3. User Story Context 🔴 REQUIRED

"As a CODITECT user, I want to log in once and instantly access my personalized development environment from any browser, with all my work automatically saved and my session surviving browser crashes, so that I can focus on coding without worrying about infrastructure or losing work."

Acceptance Criteria:

  • Login to workspace in < 3 seconds
  • Automatic session recovery on disconnect
  • Zero data loss on container restart
  • Seamless experience across devices
  • No local installation required

4. Executive Summary 🔴 REQUIRED

Y-Statement

In the context of distributed development environments, facing the challenge of providing instant, isolated, cost-effective workspaces, we decided to create CODITECT Server as a central orchestration system to achieve instant container provisioning, automatic state persistence, and centralized management, accepting increased architectural complexity.

Key Points:

  • Transforms ephemeral containers into persistent workspaces
  • Manages entire lifecycle: auth → container → state → monitoring
  • Enables 10,000+ concurrent users with sub-3-second startup
  • Reduces infrastructure costs by 70% through intelligent pooling

5. Visual Overview 🔴 REQUIRED

6. Background & Problem 🔴 REQUIRED

6.1 Current Problems

  1. Fragmented Architecture: Separate API, WebSocket, Auth services
  2. Manual Container Management: No automated lifecycle handling
  3. No State Persistence: Work lost on container restart
  4. Poor Resource Utilization: Containers run idle
  5. Limited Observability: Difficult to debug issues

6.2 Root Causes

  • Lack of central orchestration
  • No unified session management
  • Missing state persistence layer
  • No resource optimization strategy

6.3 Impact

  • High operational costs ($50K/month for 1000 users)
  • Poor user experience (5-minute startup times)
  • Data loss incidents
  • Limited scalability

7. Decision 🔴 REQUIRED

7.1 Chosen Solution

Implement CODITECT Server as the central orchestration system ("mothership") that manages all platform operations through intelligent container orchestration, state persistence, and unified monitoring.

7.2 Core Components

7.2.1 Container Orchestrator

pub struct ContainerOrchestrator {
pool_manager: ContainerPool,
lifecycle_manager: LifecycleManager,
health_checker: HealthChecker,
cost_optimizer: CostOptimizer,
}

7.2.2 Session Manager

pub struct SessionManager {
active_sessions: HashMap<UserId, Session>,
persistence: FoundationDBClient,
recovery_engine: RecoveryEngine,
}

7.2.3 Request Router

pub struct RequestRouter {
routing_table: RoutingTable,
load_balancer: LoadBalancer,
failover_handler: FailoverHandler,
}

7.3 Alternatives Considered

  1. Direct Container Access: Rejected - no central management
  2. Kubernetes Operator: Rejected - too complex for users
  3. Serverless Functions: Rejected - cold start issues
  4. Traditional VMs: Rejected - too expensive

8. Implementation Blueprint 🔴 REQUIRED

8.1 Architecture Components

8.1.1 Main Server Entry Point

// src/main.rs
use actix_web::{web, App, HttpServer};
use crate::orchestrator::ContainerOrchestrator;
use crate::session::SessionManager;
use crate::router::RequestRouter;

#[actix_web::main]
async fn main() -> std::io::Result<()> {
// Initialize components
let orchestrator = ContainerOrchestrator::new().await?;
let session_mgr = SessionManager::new().await?;
let router = RequestRouter::new().await?;

// Start HTTP server
HttpServer::new(move || {
App::new()
.app_data(web::Data::new(orchestrator.clone()))
.app_data(web::Data::new(session_mgr.clone()))
.app_data(web::Data::new(router.clone()))
.service(auth_routes())
.service(container_routes())
.service(session_routes())
.service(admin_routes())
})
.bind("0.0.0.0:8080")?
.run()
.await
}

8.1.2 Container Lifecycle Management

// src/orchestrator/lifecycle.rs
impl ContainerOrchestrator {
pub async fn launch_container(&self, user: &User) -> Result<Container> {
// 1. Check container pool for pre-warmed instance
if let Some(container) = self.pool.get_available().await? {
self.assign_to_user(container, user).await?;
return Ok(container);
}

// 2. Launch new Cloud Run container
let container = self.cloud_run_client
.create_service(&ServiceConfig {
name: format!("user-{}", user.id),
image: "coditect/workspace:latest",
memory: "4Gi",
cpu: "2",
env: vec![
("USER_ID", &user.id),
("FDB_CLUSTER", &self.fdb_cluster),
],
max_instances: 1,
min_instances: 0,
})
.await?;

// 3. Wait for container ready
self.wait_for_ready(&container).await?;

// 4. Initialize CODI inside container
self.init_codi(&container, user).await?;

Ok(container)
}
}

8.1.3 Session State Persistence

// src/session/persistence.rs
impl SessionManager {
pub async fn persist_state(&self, session: &Session) -> Result<()> {
let key = format!("{}/sessions/{}", session.tenant_id, session.id);

let state = SessionState {
user_id: session.user_id.clone(),
container_url: session.container_url.clone(),
workspace_state: session.workspace_state.clone(),
last_activity: Utc::now(),
metadata: session.metadata.clone(),
};

self.fdb.transact(|txn| {
txn.set(&key, &serialize(&state)?);
Ok(())
}).await?;

Ok(())
}

pub async fn recover_session(&self, session_id: &str) -> Result<Session> {
let key = format!("sessions/{}", session_id);

let state: SessionState = self.fdb
.transact(|txn| {
let value = txn.get(&key)?;
Ok(deserialize(&value)?)
})
.await?;

// Relaunch container if needed
if !self.is_container_healthy(&state.container_url).await? {
let container = self.orchestrator
.launch_container(&state.user_id)
.await?;
state.container_url = container.url;
}

Ok(Session::from(state))
}
}

8.1.4 Monitoring Integration

// src/monitoring/collector.rs
impl LogCollector {
pub async fn setup_container_logging(&self, container: &Container) -> Result<()> {
// Configure container to send logs to Loki
container.set_env("LOKI_URL", &self.loki_url).await?;
container.set_env("LOKI_LABELS", "app=coditect,user={user_id}").await?;

// Install Promtail agent
container.exec("apt-get install -y promtail").await?;
container.exec("systemctl start promtail").await?;

Ok(())
}
}

8.2 API Endpoints

/api/v1/auth/login:
post:
description: Authenticate user and create session
response: { session_id, workspace_url }

/api/v1/sessions/{id}:
get:
description: Get session details
delete:
description: End session and cleanup

/api/v1/containers:
post:
description: Launch new container
get:
description: List user's containers

/api/admin/metrics:
get:
description: Prometheus metrics endpoint

/api/admin/health:
get:
description: Health check for all components

8.3 Database Schema (FoundationDB)

# Session State
/{tenant_id}/sessions/{session_id} → {
user_id: string,
container_url: string,
created_at: timestamp,
last_activity: timestamp,
workspace_state: json
}

# Container Registry
/{tenant_id}/containers/{container_id} → {
user_id: string,
status: enum,
health: enum,
created_at: timestamp,
cloud_run_url: string
}

# User → Container Mapping
/{tenant_id}/user_containers/{user_id}/{container_id} → {
assigned_at: timestamp
}

9. Testing Strategy 🔴 REQUIRED

9.1 Unit Tests

#[cfg(test)]
mod tests {
use super::*;

#[tokio::test]
async fn test_container_launch_from_pool() {
let orchestrator = ContainerOrchestrator::new_test();
orchestrator.pool.add_prewarmed(5).await;

let user = User::test_user();
let container = orchestrator.launch_container(&user).await.unwrap();

assert!(container.launch_time < Duration::seconds(1));
assert_eq!(orchestrator.pool.available_count(), 4);
}

#[tokio::test]
async fn test_session_recovery_after_crash() {
let session_mgr = SessionManager::new_test();
let session = session_mgr.create_session(&user).await.unwrap();

// Simulate container crash
session_mgr.kill_container(&session.container_id).await;

// Attempt recovery
let recovered = session_mgr.recover_session(&session.id).await.unwrap();

assert_eq!(recovered.user_id, session.user_id);
assert_ne!(recovered.container_id, session.container_id); // New container
assert_eq!(recovered.workspace_state, session.workspace_state);
}
}

9.2 Integration Tests

#[tokio::test]
async fn test_full_user_flow() {
let server = TestServer::start().await;

// 1. User login
let login_resp = server.post("/api/v1/auth/login")
.json(&json!({ "email": "test@example.com", "password": "test" }))
.send()
.await
.unwrap();

assert_eq!(login_resp.status(), 200);
let session: SessionResponse = login_resp.json().await.unwrap();

// 2. Verify container launched
tokio::time::sleep(Duration::from_secs(3)).await;

let container_resp = server.get(&format!("/api/v1/sessions/{}", session.id))
.send()
.await
.unwrap();

assert_eq!(container_resp.status(), 200);
let details: SessionDetails = container_resp.json().await.unwrap();
assert_eq!(details.container_status, "ready");
}

9.3 Load Tests

#[tokio::test]
async fn test_concurrent_user_logins() {
let server = TestServer::start().await;
let users = generate_test_users(1000);

let start = Instant::now();
let handles: Vec<_> = users.into_iter()
.map(|user| {
tokio::spawn(async move {
let resp = login_user(&server, &user).await;
assert!(resp.is_ok());
})
})
.collect();

futures::future::join_all(handles).await;

assert!(start.elapsed() < Duration::from_secs(30));
}

10. Security Considerations 🔴 REQUIRED

10.1 Authentication & Authorization

  • JWT tokens with refresh mechanism
  • OAuth2 integration for GitHub
  • RBAC for admin operations
  • API key management for service accounts

10.2 Container Isolation

  • Each container runs in isolated namespace
  • Network policies prevent container-to-container communication
  • Resource quotas enforced
  • No privileged operations allowed

10.3 Data Protection

  • All FDB data encrypted at rest
  • TLS 1.3 for all communications
  • Secrets managed via Google Secret Manager
  • Regular security scanning of container images

10.4 Audit Trail

impl AuditLogger {
pub async fn log_event(&self, event: AuditEvent) -> Result<()> {
let entry = AuditEntry {
timestamp: Utc::now(),
user_id: event.user_id,
action: event.action,
resource: event.resource,
ip_address: event.ip_address,
user_agent: event.user_agent,
result: event.result,
};

self.fdb.set(
&format!("audit/{}", entry.timestamp),
&serialize(&entry)?
).await?;

Ok(())
}
}

11. Performance Characteristics 🔴 REQUIRED

11.1 Targets

  • Container launch: < 3 seconds (from pool: < 500ms)
  • Session recovery: < 1 second
  • API response time: p99 < 100ms
  • WebSocket latency: < 50ms
  • Concurrent users: 10,000 per region

11.2 Optimization Strategies

  1. Container Pooling: Pre-warm 10% of peak capacity
  2. Caching: Redis for hot session data
  3. Geographic Distribution: Deploy in 3+ regions
  4. Async Operations: All heavy operations non-blocking

11.3 Resource Requirements

Production Deployment:
CODITECT_Server:
Instances: 3 (HA)
CPU: 8 cores each
Memory: 32GB each
Disk: 100GB SSD

Container_Pool:
Pre-warmed: 100
Max_Containers: 10,000
Container_Size: 2 CPU, 4GB RAM

FoundationDB:
Nodes: 6
Storage: 1TB each
Replication: 3x

12. Operational Considerations 🔴 REQUIRED

12.1 Deployment

  • Blue-green deployment for zero downtime
  • Canary releases for new features
  • Automated rollback on errors
  • Health checks before traffic routing

12.2 Monitoring & Alerting

Metrics:
- container_launch_time
- session_recovery_time
- active_containers
- pool_utilization
- api_response_time
- error_rate

Alerts:
- Container launch > 5 seconds
- Pool utilization > 80%
- Error rate > 1%
- FDB latency > 100ms

12.3 Disaster Recovery

  • FDB backups every 6 hours
  • Session state replicated across regions
  • RTO: 15 minutes
  • RPO: 1 hour

13. Migration Strategy 🔴 REQUIRED

13.1 Phase 1: Foundation (Week 1-2)

  1. Deploy CODITECT Server infrastructure
  2. Implement basic container orchestration
  3. Connect to existing auth system
  4. Basic monitoring setup

13.2 Phase 2: Integration (Week 3-4)

  1. Migrate session management
  2. Implement state persistence
  3. Container pooling system
  4. Grafana dashboard creation

13.3 Phase 3: Optimization (Week 5-6)

  1. Multi-region deployment
  2. Advanced pooling strategies
  3. Cost optimization rules
  4. Full monitoring stack

13.4 Rollback Plan

Each phase can be rolled back independently:

  • Keep existing systems running in parallel
  • Gradual traffic migration
  • Instant rollback via load balancer

14. Consequences 🔴 REQUIRED

14.1 Positive

  • ✅ Instant workspace access (< 3 seconds)
  • ✅ Automatic state persistence
  • ✅ 70% cost reduction through pooling
  • ✅ Complete observability
  • ✅ Seamless scaling to 10K+ users

14.2 Negative

  • ❌ Increased architectural complexity
  • ❌ Additional operational overhead
  • ❌ Single point of failure risk
  • ❌ Initial development investment

14.3 Risks

  1. Complexity Risk: Mitigated by phased implementation
  2. Performance Risk: Mitigated by extensive load testing
  3. Security Risk: Mitigated by defense-in-depth approach
  4. Cost Risk: Mitigated by usage-based scaling

15. References & Standards 🔴 REQUIRED

  • ADR-001: Foundation Architecture
  • ADR-019: Prompt Engine Server Architecture
  • ADR-052-v3: Hybrid Container Architecture (reference)
  • ADR-066-v3: Ephemeral workspace Integration (reference)

15.2 External Standards

15.3 Technologies

  • Rust 1.75+
  • Actix-web 4.0
  • FoundationDB 7.1
  • Google Cloud Run
  • Grafana/Loki/Prometheus Stack

16. Appendix 🟡 OPTIONAL

16.1 Proof of Concept Results

  • POC demonstrated 2.8 second container launch
  • Successfully recovered 100 sessions after crash
  • Handled 1000 concurrent users

16.2 Cost Analysis

Current Monthly Costs (1000 users):
- Always-on VMs: $50,000
- Storage: $5,000
- Total: $55,000

Projected with CODITECT Server:
- Container pool: $10,000
- Orchestration: $5,000
- Storage: $3,000
- Total: $18,000 (67% savings)

17. Review & Approval 🔴 REQUIRED

RoleNameDateApproval
Lead ArchitectTBD-[ ]
Security LeadTBD-[ ]
Platform LeadTBD-[ ]
DevOps LeadTBD-[ ]

18. QA Review Block 🔴 REQUIRED

qa_review:
template_version: "v4.0"
review_date: "TBD"
reviewer: "TBD"
checklist:
- [ ] All required sections present
- [ ] User story clearly defined
- [ ] Technical implementation complete
- [ ] Testing strategy comprehensive
- [ ] Security addressed
- [ ] Performance targets defined
- [ ] Migration plan clear

scores:
clarity: 0/5
completeness: 0/5
implementability: 0/5
testing: 0/5

total_score: 0/20
status: "PENDING REVIEW"