ADR-008-v4: Monitoring & Observability - Part 2 (Technical)

Document: ADR-008-v4-monitoring-observability-part2-technical
Version: 2.0.0
Purpose: Technical implementation for comprehensive CODITECT monitoring and observability
Audience: Engineering teams, AI agents, DevOps engineers, SRE teams
Date Created: 2025-08-31
Date Modified: 2025-09-03
Status: UPDATED_FOR_STATEFULSETS
Changes: Added StatefulSet monitoring, PVC metrics, and workspace observability

Dependencies
Metrics Architecture
Logging Implementation
Distributed Tracing
Alerting System
Dashboard Configuration
Testing Strategy
Deployment Configuration
Performance Requirements
Integration Points
Version History
Approval

↑ Back to Top

Dependencies

# cargo.toml monitoring dependencies
[dependencies]
prometheus = "0.13.3"
tracing = "0.1.40"
tracing-subscriber = { version = "0.3.18", features = ["json", "env-filter"] }
tracing-actix-web = "0.7.9"
tracing-opentelemetry = "0.21.0"
opentelemetry = { version = "0.20.0", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.13.0", features = ["tonic"] }
serde = { version = "1.0.193", features = ["derive"] }
serde_json = "1.0.108"
tokio = { version = "1.35.0", features = ["full"] }
actix-web-prom = "0.7.0"
uuid = { version = "1.6.1", features = ["v4", "serde"] }
chrono = { version = "0.4.31", features = ["serde"] }

# Kubernetes monitoring
k8s-openapi = { version = "0.20.0", features = ["v1_28"] }
kube = { version = "0.87.0", features = ["runtime", "derive"] }

↑ Back to Top

Metrics Architecture

Core Metrics Collection

// Location: src/monitoring/metrics.rs
use prometheus::{Counter, Histogram, Gauge, IntGauge, Registry};
use lazy_static::lazy_static;

lazy_static! {
    // HTTP Request Metrics
    pub static ref HTTP_REQUESTS_TOTAL: Counter = Counter::new(
        "coditect_http_requests_total",
        "Total HTTP requests"
    ).unwrap();
    
    pub static ref HTTP_REQUEST_DURATION: Histogram = Histogram::with_opts(
        prometheus::HistogramOpts::new(
            "coditect_http_request_duration_seconds",
            "HTTP request latency distribution"
        ).buckets(vec![0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0])
    ).unwrap();
    
    // Business Metrics
    pub static ref ACTIVE_TENANTS: IntGauge = IntGauge::new(
        "coditect_active_tenants_total",
        "Number of active tenants"
    ).unwrap();
    
    pub static ref AI_REQUESTS: Counter = Counter::new(
        "coditect_ai_requests_total",
        "Total AI provider requests"
    ).unwrap();
    
    pub static ref WORKFLOW_EXECUTIONS: Counter = Counter::new(
        "coditect_workflow_executions_total",
        "Total workflow executions"
    ).unwrap();
    
    // StatefulSet Metrics
    pub static ref ACTIVE_WORKSPACES: IntGauge = IntGauge::new(
        "coditect_active_workspaces_total",
        "Number of active StatefulSet workspaces"
    ).unwrap();
    
    pub static ref WORKSPACE_STARTUP_TIME: Histogram = Histogram::with_opts(
        prometheus::HistogramOpts::new(
            "coditect_workspace_startup_seconds",
            "Time to start a StatefulSet workspace"
        ).buckets(vec![1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0])
    ).unwrap();
    
    pub static ref PVC_USAGE: Gauge = Gauge::new(
        "coditect_pvc_usage_bytes",
        "PersistentVolumeClaim usage in bytes"
    ).unwrap();
    
    pub static ref PVC_IOPS: Counter = Counter::new(
        "coditect_pvc_iops_total",
        "Total IOPS on PersistentVolumeClaims"
    ).unwrap();
    
    // System Health Metrics
    pub static ref DATABASE_CONNECTIONS: IntGauge = IntGauge::new(
        "coditect_database_connections",
        "Active FoundationDB connections"
    ).unwrap();
    
    pub static ref MEMORY_USAGE: Gauge = Gauge::new(
        "coditect_memory_usage_bytes",
        "Memory usage in bytes"
    ).unwrap();
    
    pub static ref CPU_USAGE: Gauge = Gauge::new(
        "coditect_cpu_usage_percent",
        "CPU usage percentage"
    ).unwrap();
}

pub fn init_metrics() -> Registry {
    let registry = Registry::new();
    register_all_metrics(&registry);
    registry
}

fn register_all_metrics(registry: &Registry) {
    registry.register(Box::new(HTTP_REQUESTS_TOTAL.clone())).unwrap();
    registry.register(Box::new(HTTP_REQUEST_DURATION.clone())).unwrap();
    registry.register(Box::new(ACTIVE_TENANTS.clone())).unwrap();
    registry.register(Box::new(AI_REQUESTS.clone())).unwrap();
    registry.register(Box::new(WORKFLOW_EXECUTIONS.clone())).unwrap();
    registry.register(Box::new(DATABASE_CONNECTIONS.clone())).unwrap();
    registry.register(Box::new(MEMORY_USAGE.clone())).unwrap();
    registry.register(Box::new(CPU_USAGE.clone())).unwrap();
    registry.register(Box::new(ACTIVE_WORKSPACES.clone())).unwrap();
    registry.register(Box::new(WORKSPACE_STARTUP_TIME.clone())).unwrap();
    registry.register(Box::new(PVC_USAGE.clone())).unwrap();
    registry.register(Box::new(PVC_IOPS.clone())).unwrap();
}

Custom Business Metrics

// Location: src/monitoring/business_metrics.rs
use prometheus::{Counter, Histogram};
use uuid::Uuid;

pub struct BusinessMetrics {
    pub revenue_events: Counter,
    pub user_actions: Counter,
    pub ai_cost_tracking: Histogram,
}

impl BusinessMetrics {
    pub fn record_revenue_event(&self, tenant_id: &Uuid, amount_cents: u64) {
        self.revenue_events
            .with_label_values(&[&tenant_id.to_string(), "subscription"])
            .inc();
    }
    
    pub fn record_ai_usage(&self, provider: &str, cost_cents: u64) {
        self.ai_cost_tracking
            .with_label_values(&[provider])
            .observe(cost_cents as f64 / 100.0);
    }
}

StatefulSet workspace Monitoring

// Location: src/monitoring/workspace_metrics.rs
use k8s_openapi::api::{apps::v1::StatefulSet, core::v1::{Pod, PersistentVolumeClaim}};
use kube::{Api, Client};
use prometheus::{Gauge, Counter, Histogram};

pub struct workspaceMonitor {
    k8s_client: Client,
    metrics_registry: Registry,
}

impl workspaceMonitor {
    pub async fn collect_workspace_metrics(&self) -> Result<()> {
        let statefulsets: Api<StatefulSet> = Api::namespaced(
            self.k8s_client.clone(), 
            "coditect"
        );
        
        let ss_list = statefulsets.list(&Default::default()).await?;
        ACTIVE_WORKSPACES.set(ss_list.items.len() as i64);
        
        for ss in ss_list.items {
            let workspace_id = ss.metadata.name.unwrap_or_default();
            let labels = ss.metadata.labels.unwrap_or_default();
            
            // Collect pod metrics
            self.collect_pod_metrics(&workspace_id, &labels).await?;
            
            // Collect PVC metrics
            self.collect_pvc_metrics(&workspace_id, &labels).await?;
        }
        
        Ok(())
    }
    
    async fn collect_pod_metrics(&self, workspace_id: &str, labels: &BTreeMap<String, String>) -> Result<()> {
        let pods: Api<Pod> = Api::namespaced(self.k8s_client.clone(), "coditect");
        let pod_name = format!("{}-0", workspace_id);
        
        if let Ok(pod) = pods.get(&pod_name).await {
            if let Some(status) = pod.status {
                // Record pod startup time
                if let Some(start_time) = status.start_time {
                    if let Some(condition) = status.conditions
                        .unwrap_or_default()
                        .iter()
                        .find(|c| c.type_ == "Ready" && c.status == "True") 
                    {
                        let startup_duration = condition.last_transition_time
                            .as_ref()
                            .and_then(|t| t.0.signed_duration_since(start_time.0).to_std().ok());
                        
                        if let Some(duration) = startup_duration {
                            WORKSPACE_STARTUP_TIME
                                .with_label_values(&[
                                    labels.get("tenant-id").unwrap_or(&"unknown".to_string()),
                                    labels.get("user-id").unwrap_or(&"unknown".to_string())
                                ])
                                .observe(duration.as_secs_f64());
                        }
                    }
                }
                
                // Record container metrics
                if let Some(containers) = status.container_statuses {
                    for container in containers {
                        if let Some(state) = container.state {
                            // Log container restarts
                            if container.restart_count > 0 {
                                warn!(
                                    workspace_id = %workspace_id,
                                    container = %container.name,
                                    restarts = container.restart_count,
                                    "Container has restarted"
                                );
                            }
                        }
                    }
                }
            }
        }
        
        Ok(())
    }
    
    async fn collect_pvc_metrics(&self, workspace_id: &str, labels: &BTreeMap<String, String>) -> Result<()> {
        let pvcs: Api<PersistentVolumeClaim> = Api::namespaced(self.k8s_client.clone(), "coditect");
        
        // Check workspace PVC
        let workspace_pvc = format!("workspace-{}", workspace_id);
        if let Ok(pvc) = pvcs.get(&workspace_pvc).await {
            if let Some(status) = pvc.status {
                if let Some(capacity) = status.capacity {
                    if let Some(storage) = capacity.get("storage") {
                        // Parse storage quantity and convert to bytes
                        let bytes = parse_k8s_quantity(storage);
                        PVC_USAGE
                            .with_label_values(&[
                                workspace_id,
                                "workspace",
                                labels.get("tenant-id").unwrap_or(&"unknown".to_string())
                            ])
                            .set(bytes as f64);
                    }
                }
            }
        }
        
        Ok(())
    }
}

fn parse_k8s_quantity(quantity: &str) -> u64 {
    // Simple parser for k8s quantities like "10Gi", "500Mi"
    if quantity.ends_with("Gi") {
        quantity.trim_end_matches("Gi").parse::<u64>().unwrap_or(0) * 1024 * 1024 * 1024
    } else if quantity.ends_with("Mi") {
        quantity.trim_end_matches("Mi").parse::<u64>().unwrap_or(0) * 1024 * 1024
    } else if quantity.ends_with("Ki") {
        quantity.trim_end_matches("Ki").parse::<u64>().unwrap_or(0) * 1024
    } else {
        quantity.parse::<u64>().unwrap_or(0)
    }
}

↑ Back to Top

Logging Implementation

Structured Logging

// Location: src/monitoring/logging.rs
use tracing::{info, error, warn, debug, instrument, Span};
use serde::{Serialize, Deserialize};
use uuid::Uuid;

#[derive(Serialize, Deserialize, Debug)]
pub struct CoditectLogEntry {
    pub timestamp: DateTime<Utc>,
    pub level: LogLevel,
    pub component: String,
    pub action: String,
    pub tenant_id: Option<Uuid>,
    pub user_id: Option<Uuid>,
    pub request_id: Option<String>,
    pub workspace_id: Option<String>, // StatefulSet workspace ID
    pub pod_name: Option<String>, // StatefulSet pod name
    pub details: serde_json::Value,
    pub metrics: LogMetrics,
}

#[derive(Serialize, Deserialize, Debug)]
pub enum LogLevel {
    Error,
    Warn,
    Info,
    Debug,
    Trace,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct LogMetrics {
    pub duration_ms: Option<u64>,
    pub memory_used: Option<u64>,
    pub cpu_usage: Option<f64>,
    pub pvc_usage_bytes: Option<u64>, // PersistentVolume usage
    pub pvc_iops: Option<u64>, // Storage IOPS
    pub pod_restarts: Option<u32>, // Container restart count
}

#[instrument(skip(details))]
pub fn log_business_event(
    action: &str,
    tenant_id: Option<Uuid>,
    user_id: Option<Uuid>,
    details: serde_json::Value,
) {
    let entry = CoditectLogEntry {
        timestamp: Utc::now(),
        level: LogLevel::Info,
        component: "business".to_string(),
        action: action.to_string(),
        tenant_id,
        user_id,
        request_id: extract_request_id(),
        details,
        metrics: LogMetrics {
            duration_ms: None,
            memory_used: Some(get_memory_usage()),
            cpu_usage: Some(get_cpu_usage()),
        },
    };
    
    info!(
        tenant_id = ?tenant_id,
        user_id = ?user_id,
        action = %action,
        entry = ?entry,
        "Business event logged"
    );
}

#[instrument]
pub async fn log_api_request(
    method: &str,
    path: &str,
    status: u16,
    duration: Duration,
    tenant_id: Option<Uuid>,
    user_id: Option<Uuid>,
) {
    let request_id = extract_request_id();
    
    let entry = CoditectLogEntry {
        timestamp: Utc::now(),
        level: if status >= 500 { LogLevel::Error } 
               else if status >= 400 { LogLevel::Warn } 
               else { LogLevel::Info },
        component: "api".to_string(),
        action: format!("{} {}", method, path),
        tenant_id,
        user_id,
        request_id: request_id.clone(),
        details: json!({
            "method": method,
            "path": path,
            "status": status,
        }),
        metrics: LogMetrics {
            duration_ms: Some(duration.as_millis() as u64),
            memory_used: Some(get_memory_usage()),
            cpu_usage: None,
        },
    };
    
    match status {
        200..=299 => info!(
            request_id = ?request_id,
            method = %method,
            path = %path,
            status = %status,
            duration_ms = %duration.as_millis(),
            "API request successful"
        ),
        400..=499 => warn!(
            request_id = ?request_id,
            method = %method,
            path = %path,
            status = %status,
            "API client error"
        ),
        500..=599 => error!(
            request_id = ?request_id,
            method = %method,
            path = %path,
            status = %status,
            "API server error"
        ),
        _ => debug!(
            request_id = ?request_id,
            method = %method,
            path = %path,
            status = %status,
            "API request completed"
        ),
    }
}

↑ Back to Top

Distributed Tracing

OpenTelemetry Integration

// Location: src/monitoring/tracing.rs
use opentelemetry::{trace::TraceError, global, KeyValue};
use opentelemetry_otlp::WithExportConfig;
use tracing_opentelemetry::OpenTelemetryLayer;

pub fn init_tracing() -> Result<(), TraceError> {
    global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new());
    
    let otlp_exporter = opentelemetry_otlp::new_exporter()
        .tonic()
        .with_endpoint("http://jaeger-collector:14268/api/traces");
    
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(otlp_exporter)
        .with_trace_config(
            opentelemetry::sdk::trace::config()
                .with_resource(opentelemetry::sdk::Resource::new(vec![
                    KeyValue::new("service.name", "coditect-api"),
                    KeyValue::new("service.version", env!("CARGO_PKG_version")),
                    KeyValue::new("deployment.environment", std::env::var("ENVIRONMENT").unwrap_or("development".to_string())),
                ]))
        )
        .install_batch(opentelemetry::runtime::Tokio)?;
    
    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
    
    tracing_subscriber::registry()
        .with(tracing_subscriber::EnvFilter::from_default_env())
        .with(tracing_subscriber::fmt::layer().json())
        .with(telemetry)
        .init();
    
    Ok(())
}

#[instrument(skip(tenant_id, operation))]
pub async fn trace_operation<T, F, Fut>(
    operation_name: &str,
    tenant_id: &Uuid,
    operation: F,
) -> Result<T, anyhow::Error>
where
    F: FnOnce() -> Fut,
    Fut: std::future::Future<Output = Result<T, anyhow::Error>>,
{
    let span = tracing::Span::current();
    span.record("tenant_id", &tenant_id.to_string());
    span.record("operation", operation_name);
    
    let start = std::time::Instant::now();
    let result = operation().await;
    let duration = start.elapsed();
    
    match &result {
        Ok(_) => {
            span.record("result", "success");
            info!(
                operation = %operation_name,
                tenant_id = %tenant_id,
                duration_ms = %duration.as_millis(),
                "Operation completed successfully"
            );
        }
        Err(e) => {
            span.record("result", "error");
            span.record("error", &e.to_string());
            error!(
                operation = %operation_name,
                tenant_id = %tenant_id,
                duration_ms = %duration.as_millis(),
                error = %e,
                "Operation failed"
            );
        }
    }
    
    result
}

↑ Back to Top

Alerting System

Alert Rules Configuration

# Location: deployment/prometheus/alerts.yml
groups:
- name: coditect.critical
  rules:
  - alert: APIHighLatency
    expr: histogram_quantile(0.99, rate(coditect_http_request_duration_seconds_bucket[5m])) > 0.1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "API latency is too high"
      description: "99th percentile latency is {{ $value }}s for 2+ minutes"
      
  - alert: DatabaseConnectionFailure
    expr: coditect_database_connections == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Database connection lost"
      description: "No active FoundationDB connections"
      
  - alert: HighErrorRate
    expr: rate(coditect_api_errors_total[5m]) > 0.1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "High API error rate"
      description: "Error rate is {{ $value }} errors/sec"

- name: coditect.business
  rules:
  - alert: RevenueImpact
    expr: rate(coditect_revenue_events_total[1h]) < 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Revenue events below threshold"
      description: "Only {{ $value }} revenue events in past hour"

AlertManager Configuration

# Location: deployment/alertmanager/config.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'coditect-alerts'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    group_wait: 0s
    repeat_interval: 5m

receivers:
- name: 'coditect-alerts'
  slack_configs:
  - channel: '#coditect-alerts'
    title: 'CODITECT Alert'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

- name: 'critical-alerts'
  slack_configs:
  - channel: '#coditect-critical'
    title: '🚨 CRITICAL ALERT'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  webhook_configs:
  - url: 'https://api.coditect.com/alerts/webhook'
    send_resolved: true

↑ Back to Top

Dashboard Configuration

Grafana Dashboard

{
  "dashboard": {
    "title": "CODITECT Platform Overview",
    "panels": [
      {
        "title": "API Request Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(coditect_http_requests_total[5m])",
            "legendFormat": "Requests/sec"
          }
        ]
      },
      {
        "title": "Response Time Distribution",
        "type": "histogram",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(coditect_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.99, rate(coditect_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "Active Tenants",
        "type": "stat",
        "targets": [
          {
            "expr": "coditect_active_tenants_total",
            "legendFormat": "Active Tenants"
          }
        ]
      }
    ]
  }
}

↑ Back to Top

Testing Strategy

Monitoring Tests

// Location: tests/monitoring/metrics_tests.rs
#[tokio::test]
async fn test_metrics_collection() {
    let registry = init_metrics();
    
    // Simulate API requests
    HTTP_REQUESTS_TOTAL.inc();
    HTTP_REQUEST_DURATION.observe(0.045);
    ACTIVE_TENANTS.set(42);
    
    // Verify metrics
    let metrics = registry.gather();
    assert!(!metrics.is_empty());
    
    let http_total = metrics.iter()
        .find(|m| m.get_name() == "coditect_http_requests_total")
        .unwrap();
    assert_eq!(http_total.get_metric()[0].get_counter().get_value(), 1.0);
}

#[tokio::test]
async fn test_alert_evaluation() {
    // Simulate high latency condition
    for _ in 0..100 {
        HTTP_REQUEST_DURATION.observe(0.2); // 200ms
    }
    
    // Check if alert would trigger
    let p99_latency = calculate_percentile(&HTTP_REQUEST_DURATION, 0.99);
    assert!(p99_latency > 0.1, "Alert should trigger for high latency");
}

↑ Back to Top

Performance Requirements

// Location: src/monitoring/performance.rs
pub struct MonitoringPerformanceTargets {
    pub metrics_collection_overhead: Duration,  // <1ms
    pub log_processing_latency: Duration,       // <5ms
    pub trace_overhead: Duration,               // <0.5ms
    pub dashboard_refresh: Duration,            // <2s
}

impl MonitoringPerformanceTargets {
    pub fn meets_sla(&self) -> bool {
        self.metrics_collection_overhead < Duration::from_millis(1) &&
        self.log_processing_latency < Duration::from_millis(5) &&
        self.trace_overhead < Duration::from_micros(500) &&
        self.dashboard_refresh < Duration::from_secs(2)
    }
}

↑ Back to Top

Version History

Version	Date	Changes	Author
1.0.0	2025-08-31	Initial creation	Claude Code Session 3
2.0.0	2025-09-03	Added StatefulSet monitoring, PVC metrics, workspace observability	SESSION16 DOCUMENT-DEV-4

QA Review Block

Status: AWAITING INDEPENDENT QA REVIEW

This section will be completed by an independent QA reviewer according to ADR-QA-REVIEW-GUIDE-v4.2.

Document ready for review as of: 2025-09-03
Version ready for review: 2.0.0

Changes in v2.0.0

Added StatefulSet workspace monitoring implementation (workspaceMonitor)
Introduced PVC usage and IOPS metrics collection
Added workspace-specific fields to CoditectLogEntry (workspace_id, pod_name)
Enhanced LogMetrics with StatefulSet-specific metrics (pvc_usage_bytes, pvc_iops, pod_restarts)
Added Kubernetes dependencies (k8s-openapi, kube) for native monitoring
Implemented workspace startup time tracking and container restart monitoring
Modified by: SESSION16 DOCUMENT-DEV-4 on 2025-09-03

↑ Back to Top

Approval

Approval Signatures

Role	Name	Signature	Date
Technical Lead	____________	____________	______
DevOps Lead	____________	____________	______
SRE Lead	____________	____________	______

This monitoring architecture provides complete observability for CODITECT's mission-critical AI development platform.

↑ Back to Top

Table of Contents​

Dependencies​

Metrics Architecture​

Core Metrics Collection​

Custom Business Metrics​

StatefulSet workspace Monitoring​

Logging Implementation​

Structured Logging​

Distributed Tracing​

OpenTelemetry Integration​

Alerting System​

Alert Rules Configuration​

AlertManager Configuration​

Dashboard Configuration​

Grafana Dashboard​

Testing Strategy​

Monitoring Tests​

Performance Requirements​

Version History​

QA Review Block​

Changes in v2.0.0​

Approval​

Approval Signatures​

Table of Contents

Dependencies

Metrics Architecture

Core Metrics Collection

Custom Business Metrics

StatefulSet workspace Monitoring

Logging Implementation

Structured Logging

Distributed Tracing

OpenTelemetry Integration

Alerting System

Alert Rules Configuration

AlertManager Configuration

Dashboard Configuration

Grafana Dashboard

Testing Strategy

Monitoring Tests

Performance Requirements

Version History

QA Review Block

Changes in v2.0.0

Approval

Approval Signatures