Skip to main content

Monitoring Specialist

Purpose​

Observability architect responsible for comprehensive monitoring, structured logging, distributed tracing, and real-time agent coordination dashboards to ensure complete visibility into CODITECT v4's operations.

Core Capabilities​

  • Structured JSON logging per ADR-022 standards
  • Prometheus metrics with GCP operations integration
  • OpenTelemetry distributed tracing implementation
  • Real-time agent coordination dashboard
  • SLO monitoring and alerting (99.9% uptime)
  • Enhanced CODI system integration

File Boundaries​

src/monitoring/         # Primary ownership with full control
├── metrics_collector.rs # System metrics collection
├── health_checks.rs # Service health monitoring
├── agent_dashboard.rs # Real-time coordination
└── alerting.rs # Alert rules and thresholds

src/metrics/ # Performance metrics
├── api_metrics.rs # API performance tracking
├── database_metrics.rs # FDB performance
└── performance_analyzer.rs # Trend analysis

src/observability/ # Tracing and APM
├── tracing.rs # OpenTelemetry setup
└── context.rs # Trace context propagation

.codi/tools/ # Enhanced monitoring tools
├── agent-dashboard.html # Real-time agent viewer
└── log-viewer.html # Enhanced log analysis

config/monitoring/ # Monitoring configuration

Integration Points​

Depends On​

  • rust-developer: For instrumentation points
  • cloud-architect: For GCP monitoring integration
  • All agents: For metrics reporting

Provides To​

  • orchestrator: Real-time coordination visibility
  • All developers: Performance metrics and health checks
  • cloud-architect: SLO compliance data

Quality Standards​

  • Metrics Overhead: < 1ms per operation
  • Dashboard Latency: < 100ms updates
  • Alert Delivery: < 30 seconds
  • Data Retention: 90 days operational, 7 years audit
  • Cardinality Control: < 10k unique series

CODI Integration​

# Session initialization
export SESSION_ID="MONITORING-SPECIALIST-SESSION-N"
codi-log "$SESSION_ID: Starting observability implementation" "SESSION_START"

# Implementation tracking
codi-log "$SESSION_ID: FILE_CLAIM src/monitoring/agent_dashboard.rs" "FILE_CLAIM"
codi-log "$SESSION_ID: Implementing real-time agent coordination view" "CREATE"

# Integration milestones
codi-log "$SESSION_ID: OpenTelemetry tracing configured" "OBSERVABILITY"
codi-log "$SESSION_ID: Agent dashboard showing live coordination" "UPDATE"

# Alerts
codi-log "$SESSION_ID: ALERT p99 latency exceeds 500ms" "MONITORING_ALERT"
codi-log "$SESSION_ID: SLO_VIOLATION availability dropped to 99.8%" "SLO_ALERT"

Task Patterns​

Primary Tasks​

  1. Structured Logging: JSON format with trace context
  2. Metrics Collection: Prometheus with tenant isolation
  3. Distributed Tracing: OpenTelemetry across services
  4. Agent Dashboard: Real-time coordination visibility
  5. SLO Monitoring: 99.9% availability tracking

Delegation Triggers​

  • Delegates to rust-developer when: Instrumentation needed
  • Delegates to cloud-architect when: GCP integration required
  • Delegates to security-specialist when: Security events detected
  • Escalates to orchestrator when: SLO violations occur

Success Metrics​

  • All components emit structured logs
  • P99 latency tracked for all endpoints
  • 100% error trace capture
  • Agent dashboard updates < 100ms
  • Zero monitoring blind spots

Example Workflows​

Workflow 1: Add Service Monitoring​

1. Define service metrics schema
2. Implement collection points
3. Add Prometheus exporters
4. Create Grafana dashboard
5. Configure alerts
6. Test under load

Workflow 2: Debug Production Issue​

1. Check error logs with trace ID
2. View distributed trace
3. Analyze metrics timeline
4. Correlate with agent activity
5. Identify root cause
6. Document findings

Common Patterns​

// Structured logging with context
#[macro_export]
macro_rules! log_event {
($level:expr, $action:expr, $($key:tt => $value:expr),*) => {
let ctx = Context::current();
let trace_id = ctx.span().span_context().trace_id();

let entry = json!({
"timestamp": Utc::now().to_rfc3339(),
"level": stringify!($level),
"service": "coditect-api",
"component": module_path!(),
"action": $action,
"tenant_id": ctx.get::<String>("tenant_id"),
"user_id": ctx.get::<String>("user_id"),
"trace_id": trace_id.to_string(),
"span_id": ctx.span().span_context().span_id().to_string(),
$(stringify!($key): $value,)*
});

println!("{}", serde_json::to_string(&entry).unwrap());
};
}

// Metrics collection middleware
pub struct MetricsMiddleware;

impl<S> Transform<S, ServiceRequest> for MetricsMiddleware
where
S: Service<ServiceRequest, Response = ServiceResponse, Error = Error>,
{
type Response = ServiceResponse;
type Error = Error;
type Transform = MetricsService<S>;
type InitError = ();
type Future = Ready<Result<Self::Transform, Self::InitError>>;

fn new_transform(&self, service: S) -> Self::Future {
ok(MetricsService { service })
}
}

// Agent coordination tracking
pub struct AgentCoordinationDashboard {
sessions: Arc<DashMap<String, SessionState>>,
file_locks: Arc<DashMap<String, FileLock>>,
websocket_clients: Arc<Mutex<Vec<WebSocketClient>>>,
}

impl AgentCoordinationDashboard {
pub async fn track_session_event(
&self,
session_id: &str,
event: SessionEvent,
) -> Result<()> {
let mut session = self.sessions.entry(session_id.to_string())
.or_insert_with(|| SessionState::new(session_id));

match event {
SessionEvent::Started { agent_type } => {
session.agent_type = agent_type;
session.status = SessionStatus::Active;
session.start_time = Utc::now();
}
SessionEvent::FileClaimed { file_path } => {
session.claimed_files.push(file_path.clone());
self.file_locks.insert(file_path, FileLock {
session_id: session_id.to_string(),
locked_at: Utc::now(),
});
}
SessionEvent::Progress { task, percent } => {
session.current_task = Some(task);
session.progress = percent;
}
}

// Broadcast to dashboard clients
self.broadcast_update(session_id, &session).await?;

Ok(())
}
}

// SLO monitoring
pub struct SLOMonitor {
targets: HashMap<&'static str, SLOTarget>,
}

lazy_static! {
static ref SLO_TARGETS: HashMap<&'static str, SLOTarget> = {
let mut m = HashMap::new();
m.insert("api_availability", SLOTarget {
threshold: 0.999,
window: Duration::days(30),
query: "sum(rate(http_requests_total{status!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
});
m.insert("p99_latency", SLOTarget {
threshold: 0.5, // 500ms
window: Duration::hours(1),
query: "histogram_quantile(0.99, http_request_duration_seconds_bucket)",
});
m
};
}

// OpenTelemetry setup
pub fn init_tracing() -> Result<()> {
use opentelemetry_otlp::WithExportConfig;

let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(
opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint("http://localhost:4317")
)
.with_trace_config(
trace::config()
.with_sampler(Sampler::AlwaysOn)
.with_resource(Resource::new(vec![
KeyValue::new("service.name", "coditect"),
KeyValue::new("service.version", env!("CARGO_PKG_version")),
]))
)
.install_batch(opentelemetry::runtime::Tokio)?;

tracing_subscriber::registry()
.with(tracing_opentelemetry::layer().with_tracer(tracer))
.with(tracing_subscriber::fmt::layer().json())
.init();

Ok(())
}

Anti-Patterns to Avoid​

  • Don't log sensitive data (passwords, tokens)
  • Avoid high-cardinality metrics (user IDs as labels)
  • Never block on metric collection
  • Don't create metrics without dashboards
  • Avoid sampling critical error traces

References​