MONITOR & CODI Container Provisioning Strategy for CODITECT V5

Date: 2025-10-19 Status: ARCHITECTURAL PROPOSAL Stakeholders: DevOps, Backend, Security

Executive Summary

This document proposes a production-ready strategy for provisioning file monitoring and CODI functionality in CODITECT V5 GKE pods, based on analysis of:

T2 file-monitor Rust implementation (src/file-monitor/)
V4 CODI2 reference architecture (archive/coditect-v4/codi2/)
Current GKE deployment patterns
Multi-tenant requirements

Recommendation: HYBRID APPROACH - File-monitor sidecar + Centralized logging service

Current State Analysis

File Monitor (T2 - Active)

Location: src/file-monitor/ (Rust implementation)

Capabilities:

✅ Real-time file monitoring (inotify/FSEvents)
✅ Dual logging (JSON + human-readable)
✅ WSL2 compatible (--poll flag)
✅ Debouncing and rate limiting
✅ Checksum verification (optional)
✅ Observability (Prometheus metrics)

Current Usage:

Development: Standalone daemon (scripts/monitor/start-file-monitor.sh)
Output: .coditect/logs/events.log (JSON), events-human.log
Performance: 5MB RAM idle, 60MB @ 1000 evt/s
Production Ready: YES (see docs/11-analysis/file-monitor/production.md)

CODI2 (V4 - Archived Reference)

Location: archive/coditect-v4/codi2/ (Rust implementation)

Historical Capabilities:

File monitoring (now replaced by file-monitor)
Agent coordination
Export handling
Duplicate detection
Log aggregation
Session attribution

Status: ⚠️ ARCHIVED - Reference patterns only, NOT active in T2

Key Insight: CODI2 functionality is being distributed across V5 components:

File monitoring → file-monitor (standalone Rust crate)
Agent coordination → Backend API (backend/src/handlers/agents.rs)
Log aggregation → Centralized logging service (proposed)
Session management → FoundationDB (backend/src/db/models.rs)

Architecture Options Analysis

Option 1: DaemonSet (One Monitor Per Node)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: file-monitor
  namespace: coditect-app
spec:
  selector:
    matchLabels:
      app: file-monitor
  template:
    metadata:
      labels:
        app: file-monitor
    spec:
      containers:
      - name: monitor
        image: us-central1-docker.pkg.dev/.../file-monitor:latest
        volumeMounts:
        - name: node-workspace
          mountPath: /workspace
          readOnly: true
      volumes:
      - name: node-workspace
        hostPath:
          path: /var/lib/coditect/workspaces

Pros:

✅ Single monitor per node (resource efficient)
✅ Monitors all pods on node
✅ Centralized node-level metrics

Cons:

❌ No pod-level isolation
❌ Requires hostPath volumes (security risk)
❌ Complex multi-tenant file attribution
❌ Can't monitor ephemeral pod volumes

Verdict: ❌ NOT RECOMMENDED - Breaks multi-tenant isolation

Option 2: Sidecar (One Monitor Per Pod)

apiVersion: v1
kind: Pod
metadata:
  name: user-workspace-${USER_ID}
  namespace: coditect-app
spec:
  containers:
  # Main workspace container
  - name: workspace
    image: us-central1-docker.pkg.dev/.../theia-workspace:latest
    volumeMounts:
    - name: workspace-data
      mountPath: /workspace

  # File monitor sidecar
  - name: file-monitor
    image: us-central1-docker.pkg.dev/.../file-monitor:latest
    args:
      - "/workspace"
      - "--poll"  # WSL2 compatibility
      - "--output=/logs/events.log"
    env:
    - name: USER_ID
      value: "${USER_ID}"
    - name: TENANT_ID
      value: "${TENANT_ID}"
    - name: SESSION_ID
      value: "${SESSION_ID}"
    volumeMounts:
    - name: workspace-data
      mountPath: /workspace
      readOnly: true
    - name: logs
      mountPath: /logs
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "500m"

  volumes:
  - name: workspace-data
    persistentVolumeClaim:
      claimName: user-workspace-${USER_ID}
  - name: logs
    emptyDir: {}

Pros:

✅ Perfect pod-level isolation
✅ Tenant-specific monitoring
✅ Can access pod's PVC directly
✅ Simple lifecycle (dies with pod)
✅ Clear user/session attribution

Cons:

❌ More resource usage (1 monitor per pod)
❌ Need log aggregation solution

Verdict: ✅ RECOMMENDED - Best for multi-tenant isolation

Option 3: Centralized Service (Single Monitor)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: file-monitor-central
  namespace: coditect-app
spec:
  replicas: 3  # HA
  template:
    spec:
      containers:
      - name: monitor
        image: us-central1-docker.pkg.dev/.../file-monitor:latest
        # Monitors all workspaces via shared volume

Pros:

✅ Minimal resource overhead
✅ Centralized metrics

Cons:

❌ All workspaces in single volume (HUGE security risk)
❌ No tenant isolation
❌ Single point of failure

Verdict: ❌ NOT RECOMMENDED - Violates multi-tenant security

Recommended Architecture: HYBRID APPROACH

Overview

┌─────────────────────────────────────────────────────────────┐
│  User Pod (per user)                                        │
│                                                             │
│  ┌─────────────────┐  ┌────────────────────────────────┐  │
│  │  theia IDE      │  │  File Monitor Sidecar          │  │
│  │  (Main)         │  │  - Watches /workspace          │  │
│  │                 │  │  - Outputs JSON logs           │  │
│  │  Port 3000      │  │  - Tenant isolated             │  │
│  └─────────────────┘  └────────────────────────────────┘  │
│           │                        │                        │
│           │ (writes)               │ (monitors)             │
│           ▼                        ▼                        │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  /workspace (PVC - user-workspace-${USER_ID})        │  │
│  └──────────────────────────────────────────────────────┘  │
│                              │                              │
└──────────────────────────────┼──────────────────────────────┘
                               │ (logs to)
                               ▼
                ┌──────────────────────────────┐
                │  Logging Service             │
                │  (Fluentd/Vector/Filebeat)   │
                │  - Collects from all pods    │
                │  - Enriches with metadata    │
                │  - Sends to storage          │
                └──────────────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
      ┌─────────────┐  ┌──────────────┐  ┌──────────────┐
      │ Elasticsearch│  │ FoundationDB │  │  Prometheus  │
      │ (Search)     │  │ (Audit trail)│  │  (Metrics)   │
      └─────────────┘  └──────────────┘  └──────────────┘

Components

1. File Monitor Sidecar (Per User Pod)

Responsibilities:

Monitor /workspace for file changes
Generate JSON event logs
Add user/tenant/session context
Output to stdout (for log aggregation)

Configuration:

# Start command in sidecar
/usr/local/bin/file-monitor \
  /workspace \
  --poll \
  --debounce=500 \
  --output=/dev/stdout \
  --format=json \
  --metadata="user_id=${USER_ID},tenant_id=${TENANT_ID},session_id=${SESSION_ID}"

Resource Limits:

Requests: 64Mi RAM, 100m CPU
Limits: 512Mi RAM, 500m CPU
Expected: ~100Mi RAM @ typical load (100 evt/s)

2. Log Aggregation Service (Centralized)

Technology Options:

Solution	Pros	Cons	Recommendation
Fluentd	K8s native, JSON parsing, FDB output	Resource heavy	✅ Best for FDB
Vector	Rust (fast), eBPF support	Newer, less plugins	⚠️ Good alternative
Filebeat	Lightweight, Elastic native	Weak FDB support	❌ Skip
Google Cloud Logging	Fully managed, scalable	Vendor lock-in, cost	⚠️ Option for analytics

Recommended: Fluentd with custom FDB output plugin

Fluentd Configuration:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: coditect-app
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-forward
        env:
        - name: FLUENT_FDB_CLUSTER
          value: "coditect:production@10.128.0.8:4500"
        - name: FLUENT_FDB_SUBSPACE
          value: "audit_logs"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluentd-config
          mountPath: /fluentd/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluentd-config
        configMap:
          name: fluentd-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: coditect-app
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*file-monitor*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag file.events.*
      <parse>
        @type json
        time_key timestamp
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter file.events.**>
      @type record_transformer
      <record>
        cluster_name "#{ENV['CLUSTER_NAME']}"
        namespace ${record["kubernetes"]["namespace_name"]}
        pod_name ${record["kubernetes"]["pod_name"]}
      </record>
    </filter>

    <match file.events.**>
      @type fdb
      cluster_file /etc/foundationdb/fdb.cluster
      subspace audit_logs
      tenant_key tenant_id
      <buffer>
        @type file
        path /var/log/fluentd-buffers/fdb.buffer
        flush_interval 5s
        retry_max_interval 30s
      </buffer>
    </match>

3. FoundationDB Audit Log Schema

Key Structure:

/audit_logs/
  /{tenant_id}/
    /{user_id}/
      /{session_id}/
        /{timestamp}/
          event: {JSON event data}

Event Schema:

#[derive(Serialize, Deserialize)]
pub struct FileEvent {
    pub timestamp: DateTime<Utc>,
    pub tenant_id: Uuid,
    pub user_id: Uuid,
    pub session_id: Uuid,
    pub event_type: FileEventType,  // Create, Update, Delete, Move
    pub path: String,
    pub checksum: Option<String>,
    pub size: Option<u64>,
    pub metadata: serde_json::Value,
}

#[derive(Serialize, Deserialize)]
pub enum FileEventType {
    Create,
    Update,
    Delete,
    Move { from: String, to: String },
    Access,
}

4. Agent Coordination Service (Backend API)

NEW: Agent Management Endpoints

File: backend/src/handlers/agents.rs (to be created)

use actix_web::{web, HttpResponse};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize)]
pub struct AgentActivity {
    pub session_id: Uuid,
    pub agent_type: String,  // "file-monitor", "code-gen", "review"
    pub status: AgentStatus,
    pub last_heartbeat: DateTime<Utc>,
    pub files_claimed: Vec<String>,
}

#[derive(Serialize, Deserialize)]
pub enum AgentStatus {
    Active,
    Idle,
    Stopped,
}

// POST /api/v5/agents/heartbeat
pub async fn agent_heartbeat(
    session: web::Data<Session>,
    req: web::Json<AgentActivity>,
) -> HttpResponse {
    // Store in FDB: /agents/{tenant_id}/{session_id}/heartbeat
    // Used for:
    // - Detecting dead agents
    // - File claim coordination
    // - Activity monitoring
    HttpResponse::Ok().json(/* ... */)
}

// GET /api/v5/agents/{session_id}/logs
pub async fn get_agent_logs(
    session: web::Data<Session>,
    session_id: web::Path<Uuid>,
) -> HttpResponse {
    // Query FDB: /audit_logs/{tenant_id}/{user_id}/{session_id}/
    HttpResponse::Ok().json(/* ... */)
}

REPLACES: V4 CODI2 agent coordination (was bash scripts + WebSocket)

Implementation Plan

Phase 1: File Monitor Sidecar (Week 1)

Tasks:

✅ File monitor already implemented (src/file-monitor/)
🔨 Create Docker image for sidecar deployment
- Base: rust:1.70-slim → debian:bookworm-slim
- Binary: /usr/local/bin/file-monitor
- Entrypoint: Script to parse env vars
🔨 Update user pod template
- Add sidecar container
- Configure shared volume
- Set resource limits
🔨 Test with single user pod

Dockerfile:

FROM rust:1.70-slim as builder
WORKDIR /build
COPY src/file-monitor ./
RUN cargo build --release

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/target/release/file-monitor /usr/local/bin/
COPY docker/file-monitor-entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
USER nobody
ENTRYPOINT ["/entrypoint.sh"]

Phase 2: Log Aggregation (Week 2)

Tasks:

🔨 Deploy Fluentd DaemonSet
🔨 Create custom FDB output plugin
- Use foundationdb Rust crate
- Implement Fluentd output API
🔨 Configure log routing
- File events → FDB audit_logs subspace
- Metrics → Prometheus
🔨 Test multi-pod aggregation

Custom Fluentd Plugin (Ruby):

# fluent-plugin-foundationdb/lib/fluent/plugin/out_fdb.rb
require 'fluent/plugin/output'
require 'ffi'

module Fluent::Plugin
  class FoundationDBOutput < Output
    Fluent::Plugin.register_output('fdb', self)

    config_param :cluster_file, :string, default: '/etc/foundationdb/fdb.cluster'
    config_param :subspace, :string, default: 'audit_logs'
    config_param :tenant_key, :string, default: 'tenant_id'

    def configure(conf)
      super
      # Load FDB client library
      @fdb = FFI::Library.new('libfdb_c.so')
      @fdb.attach_function :fdb_setup_network, [], :int
      # ... (see V4 reference for complete implementation)
    end

    def write(chunk)
      chunk.msgpack_each do |time, record|
        tenant_id = record[@tenant_key]
        user_id = record['user_id']
        session_id = record['session_id']

        # Write to FDB: /audit_logs/{tenant}/{user}/{session}/{timestamp}
        key = "/#{@subspace}/#{tenant_id}/#{user_id}/#{session_id}/#{time}"
        @fdb_transaction.set(key, record.to_json)
      end
    end
  end
end

Phase 3: Agent Coordination API (Week 3)

Tasks:

🔨 Create backend/src/handlers/agents.rs
🔨 Implement heartbeat endpoint
🔨 Implement log query endpoint
🔨 Add file claim/release API
🔨 Update agent dashboard (frontend)

Phase 4: Prometheus Metrics Integration (Week 4)

Tasks:

🔨 Expose file-monitor metrics on :9090/metrics
🔨 Configure Prometheus scraping
🔨 Create Grafana dashboards
🔨 Set up alerts (see production.md)

Prometheus ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: file-monitor
  namespace: coditect-app
spec:
  selector:
    matchLabels:
      app: file-monitor
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Resource Planning

Per-User Pod Resources

Component	Requests	Limits	Notes
theia IDE	512Mi / 500m	2Gi / 2000m	Main workload
File Monitor	64Mi / 100m	512Mi / 500m	Sidecar
Total	576Mi / 600m	2.5Gi / 2.5 CPU	Per user pod

Example: 100 concurrent users

Requests: 57.6 GB RAM, 60 CPU cores
Limits: 250 GB RAM, 250 CPU cores
Node Pool: n2-standard-32 (32 vCPU, 128GB RAM) × 3 nodes

Centralized Services

Service	Replicas	Resources	Total
Fluentd	3 (DaemonSet)	256Mi / 500m	768Mi / 1.5 CPU
Prometheus	1	4Gi / 2000m	4Gi / 2 CPU
Grafana	1	1Gi / 500m	1Gi / 0.5 CPU

Security Considerations

Tenant Isolation

Critical: File monitor sidecar MUST:

✅ Run as non-root user (nobody)
✅ Mount workspace volume as read-only
✅ Use resource limits (prevent DoS)
✅ Include tenant_id in ALL logs
✅ No network access (except metrics)

Network Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: file-monitor-policy
  namespace: coditect-app
spec:
  podSelector:
    matchLabels:
      component: file-monitor
  policyTypes:
  - Egress
  egress:
  # Only allow metrics export
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 9090

Audit Trail Integrity

FoundationDB Guarantees:

✅ ACID transactions (atomic writes)
✅ Multi-version concurrency (no overwrites)
✅ Tenant isolation (prefix-based)
✅ Backup to GCS (point-in-time recovery)

Retention Policy:

# Automated cleanup of old audit logs
audit_logs:
  retention_days: 90  # Legal compliance
  archive_to_gcs: true
  compress: true

Migration from V4 CODI2

Functionality Mapping

V4 CODI2 Feature	V5 Implementation	Status
File monitoring	file-monitor sidecar	✅ Done
Log aggregation	Fluentd → FDB	🔨 To build
Agent coordination	Backend API `/agents/`	🔨 To build
Export handling	Backend API `/exports/`	🔨 To build
Duplicate detection	File monitor checksums	✅ Done
Session attribution	FDB + JWT	✅ Done
Metrics	Prometheus	✅ Done

Breaking Changes

V4 → V5:

❌ No standalone CODI2 binary
❌ No bash script orchestration
❌ No local file watchers on host
✅ All functionality via API + sidecars

Testing Strategy

Unit Tests

✅ File monitor (already exists: src/file-monitor/tests/)
🔨 Fluentd FDB plugin
🔨 Agent coordination API

Integration Tests

Test Scenarios:

Single user pod: Deploy pod with sidecar, trigger file events, verify FDB writes
Multi-tenant: 10 users, concurrent file operations, check tenant isolation
Log aggregation: 1000 events/sec across 50 pods, verify no drops
Agent coordination: 5 concurrent agents claiming files, detect conflicts
Failover: Kill fluentd pod, verify buffering and recovery

Test Script (tests/integration/test-file-monitor-sidecar.sh):

#!/bin/bash
set -e

# Deploy test user pod
kubectl apply -f k8s/test/user-pod-with-monitor.yaml

# Wait for ready
kubectl wait --for=condition=Ready pod/test-user-pod -n coditect-app

# Trigger file events
kubectl exec test-user-pod -c workspace -- bash -c "
  echo 'test' > /workspace/test.txt
  echo 'update' >> /workspace/test.txt
  rm /workspace/test.txt
"

# Wait for logs to propagate
sleep 10

# Query FDB for events
curl -H "Authorization: Bearer $TEST_TOKEN" \
  http://api.coditect.ai/api/v5/agents/test-session/logs | jq

# Verify events
expected_events=3  # create, update, delete
actual_events=$(curl -s ... | jq '.events | length')

if [ "$actual_events" -eq "$expected_events" ]; then
  echo "✅ Test passed"
else
  echo "❌ Test failed: expected $expected_events, got $actual_events"
  exit 1
fi

Performance Tests

Load Test (tests/performance/load-test-monitoring.sh):

#!/bin/bash
# Simulate 100 users, 100 events/sec each = 10K events/sec total

for user_id in $(seq 1 100); do
  kubectl exec user-pod-$user_id -c workspace -- bash -c "
    for i in {1..100}; do
      echo 'data' > /workspace/file-\$i.txt
      sleep 0.01
    done
  " &
done

wait

# Check metrics
kubectl exec prometheus-0 -n monitoring -- promtool query instant \
  'rate(fs_monitor_events_published_total[1m])'

# Expected: ~10000 events/sec

Deployment Checklist

Prerequisites

FoundationDB cluster running (3 replicas minimum)
Prometheus operator installed
Grafana deployed
GKE cluster with monitoring enabled
Docker images built and pushed to Artifact Registry

Phase 1 Deployment

Build file-monitor Docker image
Push to us-central1-docker.pkg.dev/.../file-monitor:latest
Update user pod template with sidecar
Deploy test user pod
Verify file monitor starts and outputs logs
Check resource usage (should be < 100Mi RAM)

Phase 2 Deployment

Phase 3 Deployment

Phase 4 Deployment

Alternatives Considered

Alternative 1: Google Cloud Logging Only

Approach: Skip Fluentd, send all logs to GCP Cloud Logging

Pros:

Fully managed
Infinite scalability
Built-in search/analytics

Cons:

Vendor lock-in
Cost at scale ($0.50/GB ingestion)
No FDB integration (audit trail split)

Decision: ❌ Rejected - Audit trail must be in FDB for compliance

Alternative 2: Agent-less Monitoring (API-driven)

Approach: IDE sends file events to API directly, no sidecar

Pros:

Simpler deployment
Less resource usage

Cons:

Requires IDE modification
Misses non-IDE file changes
IDE can bypass monitoring (security risk)

Decision: ❌ Rejected - Monitoring must be independent of IDE

Alternative 3: eBPF-based Monitoring

Approach: Use eBPF to monitor syscalls at kernel level

Pros:

Zero overhead
Can't be bypassed
Detailed syscall info

Cons:

Requires privileged containers
Complex to implement
Linux-only

Decision: ⏳ Future consideration for advanced monitoring

Cost Analysis

Resource Costs (100 concurrent users)

Component	Resources	GKE Cost (us-central1)	Monthly Cost
User pods (100×)	600m CPU, 576Mi RAM	$0.031/vCPU-hr, $0.0034/GB-hr	~$1,350/mo
File monitor sidecars (100×)	100m CPU, 64Mi RAM	Included in user pod	$0
Fluentd (3 nodes)	1.5 CPU, 768Mi RAM	$0.031/vCPU-hr	~$35/mo
Prometheus	2 CPU, 4Gi RAM	$0.031/vCPU-hr	~$50/mo
Total			~$1,435/mo

Notes:

Sidecar costs are incremental to user pod (already budgeted)
FDB storage separate (see FDB pricing)
Does NOT include GCP Cloud Logging (optional)

Comparison to V4 CODI2

V4 Approach (Dedicated CODI2 service):

Central CODI2 deployment: 4 CPU, 8Gi RAM = ~$90/mo
But: Required hostPath access (security risk)

V5 Approach (Sidecar):

Distributed in user pods: No additional infra cost
Better isolation and scalability
Savings: ~$90/mo on infrastructure, PRICELESS on security

Rollout Strategy

Week 1: Internal Testing

Deploy to dev namespace
5 internal users
Monitor metrics and logs
Fix bugs

Week 2: Limited Beta

Deploy to staging namespace
20 beta users
Load test with realistic workloads
Tune resource limits

Week 3: Production Rollout

Deploy to production namespace
10% traffic (canary deployment)
Monitor error rates
Gradual ramp to 100%

Week 4: Full Production

All users on new architecture
Decommission V4 CODI2 references
Update documentation

Success Metrics

Performance Metrics

Metric	Target	Measurement
Monitoring Latency	< 100ms (p99)	`fs_monitor_processing_latency_us`
Log Ingestion Rate	10K events/sec	Fluentd throughput
Resource Usage	< 100Mi RAM per monitor	`kubectl top pods`
Event Drop Rate	0%	`fs_monitor_events_dropped_total`

Reliability Metrics

Metric	Target	Measurement
Uptime	99.9%	Prometheus `up` metric
Log Delivery	99.99%	FDB query count vs event count
Recovery Time	< 1 min	Fluentd pod restart time

Maintenance & Operations

Daily Operations

# Check sidecar health across all pods
kubectl get pods -n coditect-app -l component=file-monitor

# View aggregated metrics
curl http://prometheus:9090/api/v1/query?query=sum(fs_monitor_events_published_total)

# Check Fluentd buffer status
kubectl exec -it fluentd-xxx -n coditect-app -- ls -lh /var/log/fluentd-buffers/

# Query audit logs for user
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  "http://api.coditect.ai/api/v5/agents/logs?user_id=xxx&start=2025-10-19T00:00:00Z"

Incident Response

Scenario: File Monitor Crashing

Check pod logs: kubectl logs user-pod-xxx -c file-monitor -n coditect-app
Check resource limits: kubectl describe pod user-pod-xxx -n coditect-app
Increase limits if OOMKilled
Restart pod: kubectl delete pod user-pod-xxx -n coditect-app (recreated by deployment)

Scenario: Logs Not Reaching FDB

Check Fluentd status: kubectl logs fluentd-xxx -n coditect-app
Verify FDB connectivity: kubectl exec fluentd-xxx -- fdbcli --exec "status"
Check buffer: ls /var/log/fluentd-buffers/
Restart Fluentd if needed: kubectl rollout restart daemonset/fluentd -n coditect-app

Backup & Disaster Recovery

Audit Log Backup (Automated):

# CronJob to backup audit logs to GCS
apiVersion: batch/v1
kind: CronJob
metadata:
  name: audit-log-backup
  namespace: coditect-app
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM UTC
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: gcr.io/google.com/cloudsdktool/cloud-sdk:slim
            command:
            - /bin/bash
            - -c
            - |
              # Query FDB for previous day's logs
              curl -H "Authorization: Bearer $BACKUP_TOKEN" \
                "http://api.coditect.ai/api/v5/agents/logs?start=$(date -d '1 day ago' -I)&end=$(date -I)" \
                > /tmp/audit-logs-$(date -I).json

              # Upload to GCS
              gsutil cp /tmp/audit-logs-$(date -I).json gs://coditect-audit-logs/
          restartPolicy: OnFailure

Future Enhancements

Phase 5: Real-time Agent Coordination (Future)

WebSocket Gateway for agent events:

User IDE ←→ WebSocket ←→ Agent Coordination Service
    ↓                            ↓
File Monitor Sidecar         FDB (real-time queries)

Benefits:

Real-time file conflict detection
Live agent activity feed
Immediate notifications

Phase 6: ML-Powered Anomaly Detection (Future)

Use Case: Detect unusual file activity (ransomware, data exfiltration)

Architecture:

File Events → Kafka → Stream Processor → ML Model → Alerts

Metrics:

Unusual file access patterns
High deletion rates
Sensitive data exposure

Phase 7: Cross-Tenant Analytics (Future)

Use Case: Aggregate metrics across all tenants (admin view)

Queries:

Most active users
Popular file types
Resource usage trends

Appendix

A. Reference Documentation

T2 File Monitor:

Implementation: src/file-monitor/
Documentation: docs/11-analysis/file-monitor/
Production guide: docs/11-analysis/file-monitor/production.md

V4 CODI2 Reference:

Implementation: archive/coditect-v4/codi2/
Docker configs: archive/coditect-v4/codi2/config/docker/
K8s patterns: archive/coditect-v4/infrastructure/kubernetes/

GKE Deployment:

Current manifests: k8s/
Deployment guide: docs/06-backend/deploy.md

B. Glossary

Term	Definition
Sidecar	Helper container running alongside main container in same pod
DaemonSet	K8s workload that runs one pod per node
Fluentd	Open-source log aggregation tool
FDB	FoundationDB - distributed key-value store
inotify	Linux kernel subsystem for file monitoring
Prometheus	Metrics collection and alerting system

C. Contact & Support

Architecture Questions: DevOps team File Monitor Issues: Backend team FDB Schema Changes: Database team Security Concerns: Security team

Document Version: 1.0 Last Updated: 2025-10-19 Next Review: After Phase 1 deployment (Week 1)

Executive Summary​

Current State Analysis​

File Monitor (T2 - Active)​

CODI2 (V4 - Archived Reference)​

Architecture Options Analysis​

Option 1: DaemonSet (One Monitor Per Node)​

Option 2: Sidecar (One Monitor Per Pod)​

Option 3: Centralized Service (Single Monitor)​

Recommended Architecture: HYBRID APPROACH​

Overview​

Components​

1. File Monitor Sidecar (Per User Pod)​

2. Log Aggregation Service (Centralized)​

3. FoundationDB Audit Log Schema​

4. Agent Coordination Service (Backend API)​

Implementation Plan​

Phase 1: File Monitor Sidecar (Week 1)​

Phase 2: Log Aggregation (Week 2)​

Phase 3: Agent Coordination API (Week 3)​

Phase 4: Prometheus Metrics Integration (Week 4)​

Resource Planning​

Per-User Pod Resources​

Centralized Services​

Security Considerations​

Tenant Isolation​

Audit Trail Integrity​

Migration from V4 CODI2​

Functionality Mapping​

Breaking Changes​

Testing Strategy​

Unit Tests​

Integration Tests​

Performance Tests​

Deployment Checklist​

Prerequisites​

Phase 1 Deployment​

Phase 2 Deployment​

Phase 3 Deployment​

Phase 4 Deployment​

Alternatives Considered​

Alternative 1: Google Cloud Logging Only​

Alternative 2: Agent-less Monitoring (API-driven)​

Alternative 3: eBPF-based Monitoring​

Cost Analysis​

Resource Costs (100 concurrent users)​

Comparison to V4 CODI2​

Rollout Strategy​

Week 1: Internal Testing​

Week 2: Limited Beta​

Week 3: Production Rollout​

Week 4: Full Production​

Success Metrics​

Performance Metrics​

Reliability Metrics​

Maintenance & Operations​

Daily Operations​

Incident Response​

Backup & Disaster Recovery​

Future Enhancements​

Phase 5: Real-time Agent Coordination (Future)​

Phase 6: ML-Powered Anomaly Detection (Future)​

Phase 7: Cross-Tenant Analytics (Future)​

Appendix​

A. Reference Documentation​

B. Glossary​

C. Contact & Support​

Executive Summary

Current State Analysis

File Monitor (T2 - Active)

CODI2 (V4 - Archived Reference)

Architecture Options Analysis

Option 1: DaemonSet (One Monitor Per Node)

Option 2: Sidecar (One Monitor Per Pod)

Option 3: Centralized Service (Single Monitor)

Recommended Architecture: HYBRID APPROACH

Overview

Components

1. File Monitor Sidecar (Per User Pod)

2. Log Aggregation Service (Centralized)

3. FoundationDB Audit Log Schema

4. Agent Coordination Service (Backend API)

Implementation Plan

Phase 1: File Monitor Sidecar (Week 1)

Phase 2: Log Aggregation (Week 2)

Phase 3: Agent Coordination API (Week 3)

Phase 4: Prometheus Metrics Integration (Week 4)

Resource Planning

Per-User Pod Resources

Centralized Services

Security Considerations

Tenant Isolation

Audit Trail Integrity

Migration from V4 CODI2

Functionality Mapping

Breaking Changes

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Deployment Checklist

Prerequisites

Phase 1 Deployment

Phase 2 Deployment

Phase 3 Deployment

Phase 4 Deployment

Alternatives Considered

Alternative 1: Google Cloud Logging Only

Alternative 2: Agent-less Monitoring (API-driven)

Alternative 3: eBPF-based Monitoring

Cost Analysis

Resource Costs (100 concurrent users)

Comparison to V4 CODI2

Rollout Strategy

Week 1: Internal Testing

Week 2: Limited Beta

Week 3: Production Rollout

Week 4: Full Production

Success Metrics

Performance Metrics

Reliability Metrics

Maintenance & Operations

Daily Operations

Incident Response

Backup & Disaster Recovery

Future Enhancements

Phase 5: Real-time Agent Coordination (Future)

Phase 6: ML-Powered Anomaly Detection (Future)

Phase 7: Cross-Tenant Analytics (Future)

Appendix

A. Reference Documentation

B. Glossary

C. Contact & Support