Skip to main content

MONITOR & CODI Container Provisioning Strategy for CODITECT V5

Date: 2025-10-19 Status: ARCHITECTURAL PROPOSAL Stakeholders: DevOps, Backend, Security

Executive Summary

This document proposes a production-ready strategy for provisioning file monitoring and CODI functionality in CODITECT V5 GKE pods, based on analysis of:

  • T2 file-monitor Rust implementation (src/file-monitor/)
  • V4 CODI2 reference architecture (archive/coditect-v4/codi2/)
  • Current GKE deployment patterns
  • Multi-tenant requirements

Recommendation: HYBRID APPROACH - File-monitor sidecar + Centralized logging service


Current State Analysis

File Monitor (T2 - Active)

Location: src/file-monitor/ (Rust implementation)

Capabilities:

  • ✅ Real-time file monitoring (inotify/FSEvents)
  • ✅ Dual logging (JSON + human-readable)
  • ✅ WSL2 compatible (--poll flag)
  • ✅ Debouncing and rate limiting
  • ✅ Checksum verification (optional)
  • ✅ Observability (Prometheus metrics)

Current Usage:

  • Development: Standalone daemon (scripts/monitor/start-file-monitor.sh)
  • Output: .coditect/logs/events.log (JSON), events-human.log
  • Performance: 5MB RAM idle, 60MB @ 1000 evt/s
  • Production Ready: YES (see docs/11-analysis/file-monitor/production.md)

CODI2 (V4 - Archived Reference)

Location: archive/coditect-v4/codi2/ (Rust implementation)

Historical Capabilities:

  • File monitoring (now replaced by file-monitor)
  • Agent coordination
  • Export handling
  • Duplicate detection
  • Log aggregation
  • Session attribution

Status: ⚠️ ARCHIVED - Reference patterns only, NOT active in T2

Key Insight: CODI2 functionality is being distributed across V5 components:

  • File monitoring → file-monitor (standalone Rust crate)
  • Agent coordination → Backend API (backend/src/handlers/agents.rs)
  • Log aggregation → Centralized logging service (proposed)
  • Session management → FoundationDB (backend/src/db/models.rs)

Architecture Options Analysis

Option 1: DaemonSet (One Monitor Per Node)

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: file-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: file-monitor
template:
metadata:
labels:
app: file-monitor
spec:
containers:
- name: monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
volumeMounts:
- name: node-workspace
mountPath: /workspace
readOnly: true
volumes:
- name: node-workspace
hostPath:
path: /var/lib/coditect/workspaces

Pros:

  • ✅ Single monitor per node (resource efficient)
  • ✅ Monitors all pods on node
  • ✅ Centralized node-level metrics

Cons:

  • ❌ No pod-level isolation
  • ❌ Requires hostPath volumes (security risk)
  • ❌ Complex multi-tenant file attribution
  • ❌ Can't monitor ephemeral pod volumes

Verdict: ❌ NOT RECOMMENDED - Breaks multi-tenant isolation


Option 2: Sidecar (One Monitor Per Pod)

apiVersion: v1
kind: Pod
metadata:
name: user-workspace-${USER_ID}
namespace: coditect-app
spec:
containers:
# Main workspace container
- name: workspace
image: us-central1-docker.pkg.dev/.../theia-workspace:latest
volumeMounts:
- name: workspace-data
mountPath: /workspace

# File monitor sidecar
- name: file-monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
args:
- "/workspace"
- "--poll" # WSL2 compatibility
- "--output=/logs/events.log"
env:
- name: USER_ID
value: "${USER_ID}"
- name: TENANT_ID
value: "${TENANT_ID}"
- name: SESSION_ID
value: "${SESSION_ID}"
volumeMounts:
- name: workspace-data
mountPath: /workspace
readOnly: true
- name: logs
mountPath: /logs
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"

volumes:
- name: workspace-data
persistentVolumeClaim:
claimName: user-workspace-${USER_ID}
- name: logs
emptyDir: {}

Pros:

  • ✅ Perfect pod-level isolation
  • ✅ Tenant-specific monitoring
  • ✅ Can access pod's PVC directly
  • ✅ Simple lifecycle (dies with pod)
  • ✅ Clear user/session attribution

Cons:

  • ❌ More resource usage (1 monitor per pod)
  • ❌ Need log aggregation solution

Verdict: ✅ RECOMMENDED - Best for multi-tenant isolation


Option 3: Centralized Service (Single Monitor)

apiVersion: apps/v1
kind: Deployment
metadata:
name: file-monitor-central
namespace: coditect-app
spec:
replicas: 3 # HA
template:
spec:
containers:
- name: monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
# Monitors all workspaces via shared volume

Pros:

  • ✅ Minimal resource overhead
  • ✅ Centralized metrics

Cons:

  • ❌ All workspaces in single volume (HUGE security risk)
  • ❌ No tenant isolation
  • ❌ Single point of failure

Verdict: ❌ NOT RECOMMENDED - Violates multi-tenant security


Overview

┌─────────────────────────────────────────────────────────────┐
│ User Pod (per user) │
│ │
│ ┌─────────────────┐ ┌────────────────────────────────┐ │
│ │ theia IDE │ │ File Monitor Sidecar │ │
│ │ (Main) │ │ - Watches /workspace │ │
│ │ │ │ - Outputs JSON logs │ │
│ │ Port 3000 │ │ - Tenant isolated │ │
│ └─────────────────┘ └────────────────────────────────┘ │
│ │ │ │
│ │ (writes) │ (monitors) │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ /workspace (PVC - user-workspace-${USER_ID}) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│ (logs to)

┌──────────────────────────────┐
│ Logging Service │
│ (Fluentd/Vector/Filebeat) │
│ - Collects from all pods │
│ - Enriches with metadata │
│ - Sends to storage │
└──────────────────────────────┘

┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Elasticsearch│ │ FoundationDB │ │ Prometheus │
│ (Search) │ │ (Audit trail)│ │ (Metrics) │
└─────────────┘ └──────────────┘ └──────────────┘

Components

1. File Monitor Sidecar (Per User Pod)

Responsibilities:

  • Monitor /workspace for file changes
  • Generate JSON event logs
  • Add user/tenant/session context
  • Output to stdout (for log aggregation)

Configuration:

# Start command in sidecar
/usr/local/bin/file-monitor \
/workspace \
--poll \
--debounce=500 \
--output=/dev/stdout \
--format=json \
--metadata="user_id=${USER_ID},tenant_id=${TENANT_ID},session_id=${SESSION_ID}"

Resource Limits:

  • Requests: 64Mi RAM, 100m CPU
  • Limits: 512Mi RAM, 500m CPU
  • Expected: ~100Mi RAM @ typical load (100 evt/s)

2. Log Aggregation Service (Centralized)

Technology Options:

SolutionProsConsRecommendation
FluentdK8s native, JSON parsing, FDB outputResource heavy✅ Best for FDB
VectorRust (fast), eBPF supportNewer, less plugins⚠️ Good alternative
FilebeatLightweight, Elastic nativeWeak FDB support❌ Skip
Google Cloud LoggingFully managed, scalableVendor lock-in, cost⚠️ Option for analytics

Recommended: Fluentd with custom FDB output plugin

Fluentd Configuration:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: coditect-app
spec:
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-forward
env:
- name: FLUENT_FDB_CLUSTER
value: "coditect:production@10.128.0.8:4500"
- name: FLUENT_FDB_SUBSPACE
value: "audit_logs"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluentd-config
mountPath: /fluentd/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluentd-config
configMap:
name: fluentd-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: coditect-app
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*file-monitor*.log
pos_file /var/log/fluentd-containers.log.pos
tag file.events.*
<parse>
@type json
time_key timestamp
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>

<filter file.events.**>
@type record_transformer
<record>
cluster_name "#{ENV['CLUSTER_NAME']}"
namespace ${record["kubernetes"]["namespace_name"]}
pod_name ${record["kubernetes"]["pod_name"]}
</record>
</filter>

<match file.events.**>
@type fdb
cluster_file /etc/foundationdb/fdb.cluster
subspace audit_logs
tenant_key tenant_id
<buffer>
@type file
path /var/log/fluentd-buffers/fdb.buffer
flush_interval 5s
retry_max_interval 30s
</buffer>
</match>

3. FoundationDB Audit Log Schema

Key Structure:

/audit_logs/
/{tenant_id}/
/{user_id}/
/{session_id}/
/{timestamp}/
event: {JSON event data}

Event Schema:

#[derive(Serialize, Deserialize)]
pub struct FileEvent {
pub timestamp: DateTime<Utc>,
pub tenant_id: Uuid,
pub user_id: Uuid,
pub session_id: Uuid,
pub event_type: FileEventType, // Create, Update, Delete, Move
pub path: String,
pub checksum: Option<String>,
pub size: Option<u64>,
pub metadata: serde_json::Value,
}

#[derive(Serialize, Deserialize)]
pub enum FileEventType {
Create,
Update,
Delete,
Move { from: String, to: String },
Access,
}

4. Agent Coordination Service (Backend API)

NEW: Agent Management Endpoints

File: backend/src/handlers/agents.rs (to be created)

use actix_web::{web, HttpResponse};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize)]
pub struct AgentActivity {
pub session_id: Uuid,
pub agent_type: String, // "file-monitor", "code-gen", "review"
pub status: AgentStatus,
pub last_heartbeat: DateTime<Utc>,
pub files_claimed: Vec<String>,
}

#[derive(Serialize, Deserialize)]
pub enum AgentStatus {
Active,
Idle,
Stopped,
}

// POST /api/v5/agents/heartbeat
pub async fn agent_heartbeat(
session: web::Data<Session>,
req: web::Json<AgentActivity>,
) -> HttpResponse {
// Store in FDB: /agents/{tenant_id}/{session_id}/heartbeat
// Used for:
// - Detecting dead agents
// - File claim coordination
// - Activity monitoring
HttpResponse::Ok().json(/* ... */)
}

// GET /api/v5/agents/{session_id}/logs
pub async fn get_agent_logs(
session: web::Data<Session>,
session_id: web::Path<Uuid>,
) -> HttpResponse {
// Query FDB: /audit_logs/{tenant_id}/{user_id}/{session_id}/
HttpResponse::Ok().json(/* ... */)
}

REPLACES: V4 CODI2 agent coordination (was bash scripts + WebSocket)


Implementation Plan

Phase 1: File Monitor Sidecar (Week 1)

Tasks:

  1. ✅ File monitor already implemented (src/file-monitor/)
  2. 🔨 Create Docker image for sidecar deployment
    • Base: rust:1.70-slimdebian:bookworm-slim
    • Binary: /usr/local/bin/file-monitor
    • Entrypoint: Script to parse env vars
  3. 🔨 Update user pod template
    • Add sidecar container
    • Configure shared volume
    • Set resource limits
  4. 🔨 Test with single user pod

Dockerfile:

FROM rust:1.70-slim as builder
WORKDIR /build
COPY src/file-monitor ./
RUN cargo build --release

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/target/release/file-monitor /usr/local/bin/
COPY docker/file-monitor-entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
USER nobody
ENTRYPOINT ["/entrypoint.sh"]

Phase 2: Log Aggregation (Week 2)

Tasks:

  1. 🔨 Deploy Fluentd DaemonSet
  2. 🔨 Create custom FDB output plugin
    • Use foundationdb Rust crate
    • Implement Fluentd output API
  3. 🔨 Configure log routing
    • File events → FDB audit_logs subspace
    • Metrics → Prometheus
  4. 🔨 Test multi-pod aggregation

Custom Fluentd Plugin (Ruby):

# fluent-plugin-foundationdb/lib/fluent/plugin/out_fdb.rb
require 'fluent/plugin/output'
require 'ffi'

module Fluent::Plugin
class FoundationDBOutput < Output
Fluent::Plugin.register_output('fdb', self)

config_param :cluster_file, :string, default: '/etc/foundationdb/fdb.cluster'
config_param :subspace, :string, default: 'audit_logs'
config_param :tenant_key, :string, default: 'tenant_id'

def configure(conf)
super
# Load FDB client library
@fdb = FFI::Library.new('libfdb_c.so')
@fdb.attach_function :fdb_setup_network, [], :int
# ... (see V4 reference for complete implementation)
end

def write(chunk)
chunk.msgpack_each do |time, record|
tenant_id = record[@tenant_key]
user_id = record['user_id']
session_id = record['session_id']

# Write to FDB: /audit_logs/{tenant}/{user}/{session}/{timestamp}
key = "/#{@subspace}/#{tenant_id}/#{user_id}/#{session_id}/#{time}"
@fdb_transaction.set(key, record.to_json)
end
end
end
end

Phase 3: Agent Coordination API (Week 3)

Tasks:

  1. 🔨 Create backend/src/handlers/agents.rs
  2. 🔨 Implement heartbeat endpoint
  3. 🔨 Implement log query endpoint
  4. 🔨 Add file claim/release API
  5. 🔨 Update agent dashboard (frontend)

Phase 4: Prometheus Metrics Integration (Week 4)

Tasks:

  1. 🔨 Expose file-monitor metrics on :9090/metrics
  2. 🔨 Configure Prometheus scraping
  3. 🔨 Create Grafana dashboards
  4. 🔨 Set up alerts (see production.md)

Prometheus ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: file-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: file-monitor
endpoints:
- port: metrics
interval: 30s
path: /metrics

Resource Planning

Per-User Pod Resources

ComponentRequestsLimitsNotes
theia IDE512Mi / 500m2Gi / 2000mMain workload
File Monitor64Mi / 100m512Mi / 500mSidecar
Total576Mi / 600m2.5Gi / 2.5 CPUPer user pod

Example: 100 concurrent users

  • Requests: 57.6 GB RAM, 60 CPU cores
  • Limits: 250 GB RAM, 250 CPU cores
  • Node Pool: n2-standard-32 (32 vCPU, 128GB RAM) × 3 nodes

Centralized Services

ServiceReplicasResourcesTotal
Fluentd3 (DaemonSet)256Mi / 500m768Mi / 1.5 CPU
Prometheus14Gi / 2000m4Gi / 2 CPU
Grafana11Gi / 500m1Gi / 0.5 CPU

Security Considerations

Tenant Isolation

Critical: File monitor sidecar MUST:

  • ✅ Run as non-root user (nobody)
  • ✅ Mount workspace volume as read-only
  • ✅ Use resource limits (prevent DoS)
  • ✅ Include tenant_id in ALL logs
  • ✅ No network access (except metrics)

Network Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: file-monitor-policy
namespace: coditect-app
spec:
podSelector:
matchLabels:
component: file-monitor
policyTypes:
- Egress
egress:
# Only allow metrics export
- to:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090

Audit Trail Integrity

FoundationDB Guarantees:

  • ✅ ACID transactions (atomic writes)
  • ✅ Multi-version concurrency (no overwrites)
  • ✅ Tenant isolation (prefix-based)
  • ✅ Backup to GCS (point-in-time recovery)

Retention Policy:

# Automated cleanup of old audit logs
audit_logs:
retention_days: 90 # Legal compliance
archive_to_gcs: true
compress: true

Migration from V4 CODI2

Functionality Mapping

V4 CODI2 FeatureV5 ImplementationStatus
File monitoringfile-monitor sidecar✅ Done
Log aggregationFluentd → FDB🔨 To build
Agent coordinationBackend API /agents/🔨 To build
Export handlingBackend API /exports/🔨 To build
Duplicate detectionFile monitor checksums✅ Done
Session attributionFDB + JWT✅ Done
MetricsPrometheus✅ Done

Breaking Changes

V4 → V5:

  • ❌ No standalone CODI2 binary
  • ❌ No bash script orchestration
  • ❌ No local file watchers on host
  • ✅ All functionality via API + sidecars

Testing Strategy

Unit Tests

  • ✅ File monitor (already exists: src/file-monitor/tests/)
  • 🔨 Fluentd FDB plugin
  • 🔨 Agent coordination API

Integration Tests

Test Scenarios:

  1. Single user pod: Deploy pod with sidecar, trigger file events, verify FDB writes
  2. Multi-tenant: 10 users, concurrent file operations, check tenant isolation
  3. Log aggregation: 1000 events/sec across 50 pods, verify no drops
  4. Agent coordination: 5 concurrent agents claiming files, detect conflicts
  5. Failover: Kill fluentd pod, verify buffering and recovery

Test Script (tests/integration/test-file-monitor-sidecar.sh):

#!/bin/bash
set -e

# Deploy test user pod
kubectl apply -f k8s/test/user-pod-with-monitor.yaml

# Wait for ready
kubectl wait --for=condition=Ready pod/test-user-pod -n coditect-app

# Trigger file events
kubectl exec test-user-pod -c workspace -- bash -c "
echo 'test' > /workspace/test.txt
echo 'update' >> /workspace/test.txt
rm /workspace/test.txt
"

# Wait for logs to propagate
sleep 10

# Query FDB for events
curl -H "Authorization: Bearer $TEST_TOKEN" \
http://api.coditect.ai/api/v5/agents/test-session/logs | jq

# Verify events
expected_events=3 # create, update, delete
actual_events=$(curl -s ... | jq '.events | length')

if [ "$actual_events" -eq "$expected_events" ]; then
echo "✅ Test passed"
else
echo "❌ Test failed: expected $expected_events, got $actual_events"
exit 1
fi

Performance Tests

Load Test (tests/performance/load-test-monitoring.sh):

#!/bin/bash
# Simulate 100 users, 100 events/sec each = 10K events/sec total

for user_id in $(seq 1 100); do
kubectl exec user-pod-$user_id -c workspace -- bash -c "
for i in {1..100}; do
echo 'data' > /workspace/file-\$i.txt
sleep 0.01
done
" &
done

wait

# Check metrics
kubectl exec prometheus-0 -n monitoring -- promtool query instant \
'rate(fs_monitor_events_published_total[1m])'

# Expected: ~10000 events/sec

Deployment Checklist

Prerequisites

  • FoundationDB cluster running (3 replicas minimum)
  • Prometheus operator installed
  • Grafana deployed
  • GKE cluster with monitoring enabled
  • Docker images built and pushed to Artifact Registry

Phase 1 Deployment

  • Build file-monitor Docker image
  • Push to us-central1-docker.pkg.dev/.../file-monitor:latest
  • Update user pod template with sidecar
  • Deploy test user pod
  • Verify file monitor starts and outputs logs
  • Check resource usage (should be < 100Mi RAM)

Phase 2 Deployment

  • Deploy Fluentd DaemonSet
  • Install custom FDB plugin
  • Configure log routing
  • Deploy test with 10 user pods
  • Verify logs reach FDB
  • Query audit logs via API

Phase 3 Deployment

  • Deploy agent coordination API
  • Update frontend dashboard
  • Test agent heartbeat
  • Test file claim/release
  • Load test with 100 users

Phase 4 Deployment

  • Configure Prometheus scraping
  • Import Grafana dashboards
  • Set up alerts
  • Test alert firing
  • Document runbooks

Alternatives Considered

Alternative 1: Google Cloud Logging Only

Approach: Skip Fluentd, send all logs to GCP Cloud Logging

Pros:

  • Fully managed
  • Infinite scalability
  • Built-in search/analytics

Cons:

  • Vendor lock-in
  • Cost at scale ($0.50/GB ingestion)
  • No FDB integration (audit trail split)

Decision: ❌ Rejected - Audit trail must be in FDB for compliance

Alternative 2: Agent-less Monitoring (API-driven)

Approach: IDE sends file events to API directly, no sidecar

Pros:

  • Simpler deployment
  • Less resource usage

Cons:

  • Requires IDE modification
  • Misses non-IDE file changes
  • IDE can bypass monitoring (security risk)

Decision: ❌ Rejected - Monitoring must be independent of IDE

Alternative 3: eBPF-based Monitoring

Approach: Use eBPF to monitor syscalls at kernel level

Pros:

  • Zero overhead
  • Can't be bypassed
  • Detailed syscall info

Cons:

  • Requires privileged containers
  • Complex to implement
  • Linux-only

Decision: ⏳ Future consideration for advanced monitoring


Cost Analysis

Resource Costs (100 concurrent users)

ComponentResourcesGKE Cost (us-central1)Monthly Cost
User pods (100×)600m CPU, 576Mi RAM$0.031/vCPU-hr, $0.0034/GB-hr~$1,350/mo
File monitor sidecars (100×)100m CPU, 64Mi RAMIncluded in user pod$0
Fluentd (3 nodes)1.5 CPU, 768Mi RAM$0.031/vCPU-hr~$35/mo
Prometheus2 CPU, 4Gi RAM$0.031/vCPU-hr~$50/mo
Total~$1,435/mo

Notes:

  • Sidecar costs are incremental to user pod (already budgeted)
  • FDB storage separate (see FDB pricing)
  • Does NOT include GCP Cloud Logging (optional)

Comparison to V4 CODI2

V4 Approach (Dedicated CODI2 service):

  • Central CODI2 deployment: 4 CPU, 8Gi RAM = ~$90/mo
  • But: Required hostPath access (security risk)

V5 Approach (Sidecar):

  • Distributed in user pods: No additional infra cost
  • Better isolation and scalability
  • Savings: ~$90/mo on infrastructure, PRICELESS on security

Rollout Strategy

Week 1: Internal Testing

  • Deploy to dev namespace
  • 5 internal users
  • Monitor metrics and logs
  • Fix bugs

Week 2: Limited Beta

  • Deploy to staging namespace
  • 20 beta users
  • Load test with realistic workloads
  • Tune resource limits

Week 3: Production Rollout

  • Deploy to production namespace
  • 10% traffic (canary deployment)
  • Monitor error rates
  • Gradual ramp to 100%

Week 4: Full Production

  • All users on new architecture
  • Decommission V4 CODI2 references
  • Update documentation

Success Metrics

Performance Metrics

MetricTargetMeasurement
Monitoring Latency< 100ms (p99)fs_monitor_processing_latency_us
Log Ingestion Rate10K events/secFluentd throughput
Resource Usage< 100Mi RAM per monitorkubectl top pods
Event Drop Rate0%fs_monitor_events_dropped_total

Reliability Metrics

MetricTargetMeasurement
Uptime99.9%Prometheus up metric
Log Delivery99.99%FDB query count vs event count
Recovery Time< 1 minFluentd pod restart time

Maintenance & Operations

Daily Operations

# Check sidecar health across all pods
kubectl get pods -n coditect-app -l component=file-monitor

# View aggregated metrics
curl http://prometheus:9090/api/v1/query?query=sum(fs_monitor_events_published_total)

# Check Fluentd buffer status
kubectl exec -it fluentd-xxx -n coditect-app -- ls -lh /var/log/fluentd-buffers/

# Query audit logs for user
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
"http://api.coditect.ai/api/v5/agents/logs?user_id=xxx&start=2025-10-19T00:00:00Z"

Incident Response

Scenario: File Monitor Crashing

  1. Check pod logs: kubectl logs user-pod-xxx -c file-monitor -n coditect-app
  2. Check resource limits: kubectl describe pod user-pod-xxx -n coditect-app
  3. Increase limits if OOMKilled
  4. Restart pod: kubectl delete pod user-pod-xxx -n coditect-app (recreated by deployment)

Scenario: Logs Not Reaching FDB

  1. Check Fluentd status: kubectl logs fluentd-xxx -n coditect-app
  2. Verify FDB connectivity: kubectl exec fluentd-xxx -- fdbcli --exec "status"
  3. Check buffer: ls /var/log/fluentd-buffers/
  4. Restart Fluentd if needed: kubectl rollout restart daemonset/fluentd -n coditect-app

Backup & Disaster Recovery

Audit Log Backup (Automated):

# CronJob to backup audit logs to GCS
apiVersion: batch/v1
kind: CronJob
metadata:
name: audit-log-backup
namespace: coditect-app
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: gcr.io/google.com/cloudsdktool/cloud-sdk:slim
command:
- /bin/bash
- -c
- |
# Query FDB for previous day's logs
curl -H "Authorization: Bearer $BACKUP_TOKEN" \
"http://api.coditect.ai/api/v5/agents/logs?start=$(date -d '1 day ago' -I)&end=$(date -I)" \
> /tmp/audit-logs-$(date -I).json

# Upload to GCS
gsutil cp /tmp/audit-logs-$(date -I).json gs://coditect-audit-logs/
restartPolicy: OnFailure

Future Enhancements

Phase 5: Real-time Agent Coordination (Future)

WebSocket Gateway for agent events:

User IDE ←→ WebSocket ←→ Agent Coordination Service
↓ ↓
File Monitor Sidecar FDB (real-time queries)

Benefits:

  • Real-time file conflict detection
  • Live agent activity feed
  • Immediate notifications

Phase 6: ML-Powered Anomaly Detection (Future)

Use Case: Detect unusual file activity (ransomware, data exfiltration)

Architecture:

File Events → Kafka → Stream Processor → ML Model → Alerts

Metrics:

  • Unusual file access patterns
  • High deletion rates
  • Sensitive data exposure

Phase 7: Cross-Tenant Analytics (Future)

Use Case: Aggregate metrics across all tenants (admin view)

Queries:

  • Most active users
  • Popular file types
  • Resource usage trends

Appendix

A. Reference Documentation

T2 File Monitor:

  • Implementation: src/file-monitor/
  • Documentation: docs/11-analysis/file-monitor/
  • Production guide: docs/11-analysis/file-monitor/production.md

V4 CODI2 Reference:

  • Implementation: archive/coditect-v4/codi2/
  • Docker configs: archive/coditect-v4/codi2/config/docker/
  • K8s patterns: archive/coditect-v4/infrastructure/kubernetes/

GKE Deployment:

  • Current manifests: k8s/
  • Deployment guide: docs/06-backend/deploy.md

B. Glossary

TermDefinition
SidecarHelper container running alongside main container in same pod
DaemonSetK8s workload that runs one pod per node
FluentdOpen-source log aggregation tool
FDBFoundationDB - distributed key-value store
inotifyLinux kernel subsystem for file monitoring
PrometheusMetrics collection and alerting system

C. Contact & Support

Architecture Questions: DevOps team File Monitor Issues: Backend team FDB Schema Changes: Database team Security Concerns: Security team


Document Version: 1.0 Last Updated: 2025-10-19 Next Review: After Phase 1 deployment (Week 1)