Skip to main content

ADR-008-v4: Monitoring & Observability - Part 2 (Technical)

Document: ADR-008-v4-monitoring-observability-part2-technical
Version: 2.0.0
Purpose: Technical implementation for comprehensive CODITECT monitoring and observability
Audience: Engineering teams, AI agents, DevOps engineers, SRE teams
Date Created: 2025-08-31
Date Modified: 2025-09-03
Status: UPDATED_FOR_STATEFULSETS
Changes: Added StatefulSet monitoring, PVC metrics, and workspace observability

Table of Contents​

↑ Back to Top


Dependencies​

# cargo.toml monitoring dependencies
[dependencies]
prometheus = "0.13.3"
tracing = "0.1.40"
tracing-subscriber = { version = "0.3.18", features = ["json", "env-filter"] }
tracing-actix-web = "0.7.9"
tracing-opentelemetry = "0.21.0"
opentelemetry = { version = "0.20.0", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.13.0", features = ["tonic"] }
serde = { version = "1.0.193", features = ["derive"] }
serde_json = "1.0.108"
tokio = { version = "1.35.0", features = ["full"] }
actix-web-prom = "0.7.0"
uuid = { version = "1.6.1", features = ["v4", "serde"] }
chrono = { version = "0.4.31", features = ["serde"] }

# Kubernetes monitoring
k8s-openapi = { version = "0.20.0", features = ["v1_28"] }
kube = { version = "0.87.0", features = ["runtime", "derive"] }

↑ Back to Top


Metrics Architecture​

Core Metrics Collection​

// Location: src/monitoring/metrics.rs
use prometheus::{Counter, Histogram, Gauge, IntGauge, Registry};
use lazy_static::lazy_static;

lazy_static! {
// HTTP Request Metrics
pub static ref HTTP_REQUESTS_TOTAL: Counter = Counter::new(
"coditect_http_requests_total",
"Total HTTP requests"
).unwrap();

pub static ref HTTP_REQUEST_DURATION: Histogram = Histogram::with_opts(
prometheus::HistogramOpts::new(
"coditect_http_request_duration_seconds",
"HTTP request latency distribution"
).buckets(vec![0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0])
).unwrap();

// Business Metrics
pub static ref ACTIVE_TENANTS: IntGauge = IntGauge::new(
"coditect_active_tenants_total",
"Number of active tenants"
).unwrap();

pub static ref AI_REQUESTS: Counter = Counter::new(
"coditect_ai_requests_total",
"Total AI provider requests"
).unwrap();

pub static ref WORKFLOW_EXECUTIONS: Counter = Counter::new(
"coditect_workflow_executions_total",
"Total workflow executions"
).unwrap();

// StatefulSet Metrics
pub static ref ACTIVE_WORKSPACES: IntGauge = IntGauge::new(
"coditect_active_workspaces_total",
"Number of active StatefulSet workspaces"
).unwrap();

pub static ref WORKSPACE_STARTUP_TIME: Histogram = Histogram::with_opts(
prometheus::HistogramOpts::new(
"coditect_workspace_startup_seconds",
"Time to start a StatefulSet workspace"
).buckets(vec![1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0])
).unwrap();

pub static ref PVC_USAGE: Gauge = Gauge::new(
"coditect_pvc_usage_bytes",
"PersistentVolumeClaim usage in bytes"
).unwrap();

pub static ref PVC_IOPS: Counter = Counter::new(
"coditect_pvc_iops_total",
"Total IOPS on PersistentVolumeClaims"
).unwrap();

// System Health Metrics
pub static ref DATABASE_CONNECTIONS: IntGauge = IntGauge::new(
"coditect_database_connections",
"Active FoundationDB connections"
).unwrap();

pub static ref MEMORY_USAGE: Gauge = Gauge::new(
"coditect_memory_usage_bytes",
"Memory usage in bytes"
).unwrap();

pub static ref CPU_USAGE: Gauge = Gauge::new(
"coditect_cpu_usage_percent",
"CPU usage percentage"
).unwrap();
}

pub fn init_metrics() -> Registry {
let registry = Registry::new();
register_all_metrics(&registry);
registry
}

fn register_all_metrics(registry: &Registry) {
registry.register(Box::new(HTTP_REQUESTS_TOTAL.clone())).unwrap();
registry.register(Box::new(HTTP_REQUEST_DURATION.clone())).unwrap();
registry.register(Box::new(ACTIVE_TENANTS.clone())).unwrap();
registry.register(Box::new(AI_REQUESTS.clone())).unwrap();
registry.register(Box::new(WORKFLOW_EXECUTIONS.clone())).unwrap();
registry.register(Box::new(DATABASE_CONNECTIONS.clone())).unwrap();
registry.register(Box::new(MEMORY_USAGE.clone())).unwrap();
registry.register(Box::new(CPU_USAGE.clone())).unwrap();
registry.register(Box::new(ACTIVE_WORKSPACES.clone())).unwrap();
registry.register(Box::new(WORKSPACE_STARTUP_TIME.clone())).unwrap();
registry.register(Box::new(PVC_USAGE.clone())).unwrap();
registry.register(Box::new(PVC_IOPS.clone())).unwrap();
}

Custom Business Metrics​

// Location: src/monitoring/business_metrics.rs
use prometheus::{Counter, Histogram};
use uuid::Uuid;

pub struct BusinessMetrics {
pub revenue_events: Counter,
pub user_actions: Counter,
pub ai_cost_tracking: Histogram,
}

impl BusinessMetrics {
pub fn record_revenue_event(&self, tenant_id: &Uuid, amount_cents: u64) {
self.revenue_events
.with_label_values(&[&tenant_id.to_string(), "subscription"])
.inc();
}

pub fn record_ai_usage(&self, provider: &str, cost_cents: u64) {
self.ai_cost_tracking
.with_label_values(&[provider])
.observe(cost_cents as f64 / 100.0);
}
}

StatefulSet workspace Monitoring​

// Location: src/monitoring/workspace_metrics.rs
use k8s_openapi::api::{apps::v1::StatefulSet, core::v1::{Pod, PersistentVolumeClaim}};
use kube::{Api, Client};
use prometheus::{Gauge, Counter, Histogram};

pub struct workspaceMonitor {
k8s_client: Client,
metrics_registry: Registry,
}

impl workspaceMonitor {
pub async fn collect_workspace_metrics(&self) -> Result<()> {
let statefulsets: Api<StatefulSet> = Api::namespaced(
self.k8s_client.clone(),
"coditect"
);

let ss_list = statefulsets.list(&Default::default()).await?;
ACTIVE_WORKSPACES.set(ss_list.items.len() as i64);

for ss in ss_list.items {
let workspace_id = ss.metadata.name.unwrap_or_default();
let labels = ss.metadata.labels.unwrap_or_default();

// Collect pod metrics
self.collect_pod_metrics(&workspace_id, &labels).await?;

// Collect PVC metrics
self.collect_pvc_metrics(&workspace_id, &labels).await?;
}

Ok(())
}

async fn collect_pod_metrics(&self, workspace_id: &str, labels: &BTreeMap<String, String>) -> Result<()> {
let pods: Api<Pod> = Api::namespaced(self.k8s_client.clone(), "coditect");
let pod_name = format!("{}-0", workspace_id);

if let Ok(pod) = pods.get(&pod_name).await {
if let Some(status) = pod.status {
// Record pod startup time
if let Some(start_time) = status.start_time {
if let Some(condition) = status.conditions
.unwrap_or_default()
.iter()
.find(|c| c.type_ == "Ready" && c.status == "True")
{
let startup_duration = condition.last_transition_time
.as_ref()
.and_then(|t| t.0.signed_duration_since(start_time.0).to_std().ok());

if let Some(duration) = startup_duration {
WORKSPACE_STARTUP_TIME
.with_label_values(&[
labels.get("tenant-id").unwrap_or(&"unknown".to_string()),
labels.get("user-id").unwrap_or(&"unknown".to_string())
])
.observe(duration.as_secs_f64());
}
}
}

// Record container metrics
if let Some(containers) = status.container_statuses {
for container in containers {
if let Some(state) = container.state {
// Log container restarts
if container.restart_count > 0 {
warn!(
workspace_id = %workspace_id,
container = %container.name,
restarts = container.restart_count,
"Container has restarted"
);
}
}
}
}
}
}

Ok(())
}

async fn collect_pvc_metrics(&self, workspace_id: &str, labels: &BTreeMap<String, String>) -> Result<()> {
let pvcs: Api<PersistentVolumeClaim> = Api::namespaced(self.k8s_client.clone(), "coditect");

// Check workspace PVC
let workspace_pvc = format!("workspace-{}", workspace_id);
if let Ok(pvc) = pvcs.get(&workspace_pvc).await {
if let Some(status) = pvc.status {
if let Some(capacity) = status.capacity {
if let Some(storage) = capacity.get("storage") {
// Parse storage quantity and convert to bytes
let bytes = parse_k8s_quantity(storage);
PVC_USAGE
.with_label_values(&[
workspace_id,
"workspace",
labels.get("tenant-id").unwrap_or(&"unknown".to_string())
])
.set(bytes as f64);
}
}
}
}

Ok(())
}
}

fn parse_k8s_quantity(quantity: &str) -> u64 {
// Simple parser for k8s quantities like "10Gi", "500Mi"
if quantity.ends_with("Gi") {
quantity.trim_end_matches("Gi").parse::<u64>().unwrap_or(0) * 1024 * 1024 * 1024
} else if quantity.ends_with("Mi") {
quantity.trim_end_matches("Mi").parse::<u64>().unwrap_or(0) * 1024 * 1024
} else if quantity.ends_with("Ki") {
quantity.trim_end_matches("Ki").parse::<u64>().unwrap_or(0) * 1024
} else {
quantity.parse::<u64>().unwrap_or(0)
}
}

↑ Back to Top


Logging Implementation​

Structured Logging​

// Location: src/monitoring/logging.rs
use tracing::{info, error, warn, debug, instrument, Span};
use serde::{Serialize, Deserialize};
use uuid::Uuid;

#[derive(Serialize, Deserialize, Debug)]
pub struct CoditectLogEntry {
pub timestamp: DateTime<Utc>,
pub level: LogLevel,
pub component: String,
pub action: String,
pub tenant_id: Option<Uuid>,
pub user_id: Option<Uuid>,
pub request_id: Option<String>,
pub workspace_id: Option<String>, // StatefulSet workspace ID
pub pod_name: Option<String>, // StatefulSet pod name
pub details: serde_json::Value,
pub metrics: LogMetrics,
}

#[derive(Serialize, Deserialize, Debug)]
pub enum LogLevel {
Error,
Warn,
Info,
Debug,
Trace,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct LogMetrics {
pub duration_ms: Option<u64>,
pub memory_used: Option<u64>,
pub cpu_usage: Option<f64>,
pub pvc_usage_bytes: Option<u64>, // PersistentVolume usage
pub pvc_iops: Option<u64>, // Storage IOPS
pub pod_restarts: Option<u32>, // Container restart count
}

#[instrument(skip(details))]
pub fn log_business_event(
action: &str,
tenant_id: Option<Uuid>,
user_id: Option<Uuid>,
details: serde_json::Value,
) {
let entry = CoditectLogEntry {
timestamp: Utc::now(),
level: LogLevel::Info,
component: "business".to_string(),
action: action.to_string(),
tenant_id,
user_id,
request_id: extract_request_id(),
details,
metrics: LogMetrics {
duration_ms: None,
memory_used: Some(get_memory_usage()),
cpu_usage: Some(get_cpu_usage()),
},
};

info!(
tenant_id = ?tenant_id,
user_id = ?user_id,
action = %action,
entry = ?entry,
"Business event logged"
);
}

#[instrument]
pub async fn log_api_request(
method: &str,
path: &str,
status: u16,
duration: Duration,
tenant_id: Option<Uuid>,
user_id: Option<Uuid>,
) {
let request_id = extract_request_id();

let entry = CoditectLogEntry {
timestamp: Utc::now(),
level: if status >= 500 { LogLevel::Error }
else if status >= 400 { LogLevel::Warn }
else { LogLevel::Info },
component: "api".to_string(),
action: format!("{} {}", method, path),
tenant_id,
user_id,
request_id: request_id.clone(),
details: json!({
"method": method,
"path": path,
"status": status,
}),
metrics: LogMetrics {
duration_ms: Some(duration.as_millis() as u64),
memory_used: Some(get_memory_usage()),
cpu_usage: None,
},
};

match status {
200..=299 => info!(
request_id = ?request_id,
method = %method,
path = %path,
status = %status,
duration_ms = %duration.as_millis(),
"API request successful"
),
400..=499 => warn!(
request_id = ?request_id,
method = %method,
path = %path,
status = %status,
"API client error"
),
500..=599 => error!(
request_id = ?request_id,
method = %method,
path = %path,
status = %status,
"API server error"
),
_ => debug!(
request_id = ?request_id,
method = %method,
path = %path,
status = %status,
"API request completed"
),
}
}

↑ Back to Top


Distributed Tracing​

OpenTelemetry Integration​

// Location: src/monitoring/tracing.rs
use opentelemetry::{trace::TraceError, global, KeyValue};
use opentelemetry_otlp::WithExportConfig;
use tracing_opentelemetry::OpenTelemetryLayer;

pub fn init_tracing() -> Result<(), TraceError> {
global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new());

let otlp_exporter = opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint("http://jaeger-collector:14268/api/traces");

let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(otlp_exporter)
.with_trace_config(
opentelemetry::sdk::trace::config()
.with_resource(opentelemetry::sdk::Resource::new(vec![
KeyValue::new("service.name", "coditect-api"),
KeyValue::new("service.version", env!("CARGO_PKG_version")),
KeyValue::new("deployment.environment", std::env::var("ENVIRONMENT").unwrap_or("development".to_string())),
]))
)
.install_batch(opentelemetry::runtime::Tokio)?;

let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

tracing_subscriber::registry()
.with(tracing_subscriber::EnvFilter::from_default_env())
.with(tracing_subscriber::fmt::layer().json())
.with(telemetry)
.init();

Ok(())
}

#[instrument(skip(tenant_id, operation))]
pub async fn trace_operation<T, F, Fut>(
operation_name: &str,
tenant_id: &Uuid,
operation: F,
) -> Result<T, anyhow::Error>
where
F: FnOnce() -> Fut,
Fut: std::future::Future<Output = Result<T, anyhow::Error>>,
{
let span = tracing::Span::current();
span.record("tenant_id", &tenant_id.to_string());
span.record("operation", operation_name);

let start = std::time::Instant::now();
let result = operation().await;
let duration = start.elapsed();

match &result {
Ok(_) => {
span.record("result", "success");
info!(
operation = %operation_name,
tenant_id = %tenant_id,
duration_ms = %duration.as_millis(),
"Operation completed successfully"
);
}
Err(e) => {
span.record("result", "error");
span.record("error", &e.to_string());
error!(
operation = %operation_name,
tenant_id = %tenant_id,
duration_ms = %duration.as_millis(),
error = %e,
"Operation failed"
);
}
}

result
}

↑ Back to Top


Alerting System​

Alert Rules Configuration​

# Location: deployment/prometheus/alerts.yml
groups:
- name: coditect.critical
rules:
- alert: APIHighLatency
expr: histogram_quantile(0.99, rate(coditect_http_request_duration_seconds_bucket[5m])) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "API latency is too high"
description: "99th percentile latency is {{ $value }}s for 2+ minutes"

- alert: DatabaseConnectionFailure
expr: coditect_database_connections == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database connection lost"
description: "No active FoundationDB connections"

- alert: HighErrorRate
expr: rate(coditect_api_errors_total[5m]) > 0.1
for: 1m
labels:
severity: warning
annotations:
summary: "High API error rate"
description: "Error rate is {{ $value }} errors/sec"

- name: coditect.business
rules:
- alert: RevenueImpact
expr: rate(coditect_revenue_events_total[1h]) < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Revenue events below threshold"
description: "Only {{ $value }} revenue events in past hour"

AlertManager Configuration​

# Location: deployment/alertmanager/config.yml
global:
slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'coditect-alerts'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
repeat_interval: 5m

receivers:
- name: 'coditect-alerts'
slack_configs:
- channel: '#coditect-alerts'
title: 'CODITECT Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

- name: 'critical-alerts'
slack_configs:
- channel: '#coditect-critical'
title: '🚨 CRITICAL ALERT'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
webhook_configs:
- url: 'https://api.coditect.com/alerts/webhook'
send_resolved: true

↑ Back to Top


Dashboard Configuration​

Grafana Dashboard​

{
"dashboard": {
"title": "CODITECT Platform Overview",
"panels": [
{
"title": "API Request Rate",
"type": "stat",
"targets": [
{
"expr": "rate(coditect_http_requests_total[5m])",
"legendFormat": "Requests/sec"
}
]
},
{
"title": "Response Time Distribution",
"type": "histogram",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(coditect_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.99, rate(coditect_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
]
},
{
"title": "Active Tenants",
"type": "stat",
"targets": [
{
"expr": "coditect_active_tenants_total",
"legendFormat": "Active Tenants"
}
]
}
]
}
}

↑ Back to Top


Testing Strategy​

Monitoring Tests​

// Location: tests/monitoring/metrics_tests.rs
#[tokio::test]
async fn test_metrics_collection() {
let registry = init_metrics();

// Simulate API requests
HTTP_REQUESTS_TOTAL.inc();
HTTP_REQUEST_DURATION.observe(0.045);
ACTIVE_TENANTS.set(42);

// Verify metrics
let metrics = registry.gather();
assert!(!metrics.is_empty());

let http_total = metrics.iter()
.find(|m| m.get_name() == "coditect_http_requests_total")
.unwrap();
assert_eq!(http_total.get_metric()[0].get_counter().get_value(), 1.0);
}

#[tokio::test]
async fn test_alert_evaluation() {
// Simulate high latency condition
for _ in 0..100 {
HTTP_REQUEST_DURATION.observe(0.2); // 200ms
}

// Check if alert would trigger
let p99_latency = calculate_percentile(&HTTP_REQUEST_DURATION, 0.99);
assert!(p99_latency > 0.1, "Alert should trigger for high latency");
}

↑ Back to Top


Performance Requirements​

// Location: src/monitoring/performance.rs
pub struct MonitoringPerformanceTargets {
pub metrics_collection_overhead: Duration, // <1ms
pub log_processing_latency: Duration, // <5ms
pub trace_overhead: Duration, // <0.5ms
pub dashboard_refresh: Duration, // <2s
}

impl MonitoringPerformanceTargets {
pub fn meets_sla(&self) -> bool {
self.metrics_collection_overhead < Duration::from_millis(1) &&
self.log_processing_latency < Duration::from_millis(5) &&
self.trace_overhead < Duration::from_micros(500) &&
self.dashboard_refresh < Duration::from_secs(2)
}
}

↑ Back to Top


Version History​

VersionDateChangesAuthor
1.0.02025-08-31Initial creationClaude Code Session 3
2.0.02025-09-03Added StatefulSet monitoring, PVC metrics, workspace observabilitySESSION16 DOCUMENT-DEV-4

QA Review Block​

Status: AWAITING INDEPENDENT QA REVIEW

This section will be completed by an independent QA reviewer according to ADR-QA-REVIEW-GUIDE-v4.2.

Document ready for review as of: 2025-09-03
Version ready for review: 2.0.0


Changes in v2.0.0​

  • Added StatefulSet workspace monitoring implementation (workspaceMonitor)
  • Introduced PVC usage and IOPS metrics collection
  • Added workspace-specific fields to CoditectLogEntry (workspace_id, pod_name)
  • Enhanced LogMetrics with StatefulSet-specific metrics (pvc_usage_bytes, pvc_iops, pod_restarts)
  • Added Kubernetes dependencies (k8s-openapi, kube) for native monitoring
  • Implemented workspace startup time tracking and container restart monitoring
  • Modified by: SESSION16 DOCUMENT-DEV-4 on 2025-09-03

↑ Back to Top


Approval​

Approval Signatures​

RoleNameSignatureDate
Technical Lead______________________________
DevOps Lead______________________________
SRE Lead______________________________

This monitoring architecture provides complete observability for CODITECT's mission-critical AI development platform.

↑ Back to Top