MONITOR & CODI Container Provisioning Strategy for CODITECT V5
Date: 2025-10-19 Status: ARCHITECTURAL PROPOSAL Stakeholders: DevOps, Backend, Security
Executive Summary
This document proposes a production-ready strategy for provisioning file monitoring and CODI functionality in CODITECT V5 GKE pods, based on analysis of:
- T2 file-monitor Rust implementation (
src/file-monitor/) - V4 CODI2 reference architecture (
archive/coditect-v4/codi2/) - Current GKE deployment patterns
- Multi-tenant requirements
Recommendation: HYBRID APPROACH - File-monitor sidecar + Centralized logging service
Current State Analysis
File Monitor (T2 - Active)
Location: src/file-monitor/ (Rust implementation)
Capabilities:
- ✅ Real-time file monitoring (inotify/FSEvents)
- ✅ Dual logging (JSON + human-readable)
- ✅ WSL2 compatible (--poll flag)
- ✅ Debouncing and rate limiting
- ✅ Checksum verification (optional)
- ✅ Observability (Prometheus metrics)
Current Usage:
- Development: Standalone daemon (
scripts/monitor/start-file-monitor.sh) - Output:
.coditect/logs/events.log(JSON),events-human.log - Performance: 5MB RAM idle, 60MB @ 1000 evt/s
- Production Ready: YES (see
docs/11-analysis/file-monitor/production.md)
CODI2 (V4 - Archived Reference)
Location: archive/coditect-v4/codi2/ (Rust implementation)
Historical Capabilities:
- File monitoring (now replaced by file-monitor)
- Agent coordination
- Export handling
- Duplicate detection
- Log aggregation
- Session attribution
Status: ⚠️ ARCHIVED - Reference patterns only, NOT active in T2
Key Insight: CODI2 functionality is being distributed across V5 components:
- File monitoring →
file-monitor(standalone Rust crate) - Agent coordination → Backend API (
backend/src/handlers/agents.rs) - Log aggregation → Centralized logging service (proposed)
- Session management → FoundationDB (
backend/src/db/models.rs)
Architecture Options Analysis
Option 1: DaemonSet (One Monitor Per Node)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: file-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: file-monitor
template:
metadata:
labels:
app: file-monitor
spec:
containers:
- name: monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
volumeMounts:
- name: node-workspace
mountPath: /workspace
readOnly: true
volumes:
- name: node-workspace
hostPath:
path: /var/lib/coditect/workspaces
Pros:
- ✅ Single monitor per node (resource efficient)
- ✅ Monitors all pods on node
- ✅ Centralized node-level metrics
Cons:
- ❌ No pod-level isolation
- ❌ Requires hostPath volumes (security risk)
- ❌ Complex multi-tenant file attribution
- ❌ Can't monitor ephemeral pod volumes
Verdict: ❌ NOT RECOMMENDED - Breaks multi-tenant isolation
Option 2: Sidecar (One Monitor Per Pod)
apiVersion: v1
kind: Pod
metadata:
name: user-workspace-${USER_ID}
namespace: coditect-app
spec:
containers:
# Main workspace container
- name: workspace
image: us-central1-docker.pkg.dev/.../theia-workspace:latest
volumeMounts:
- name: workspace-data
mountPath: /workspace
# File monitor sidecar
- name: file-monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
args:
- "/workspace"
- "--poll" # WSL2 compatibility
- "--output=/logs/events.log"
env:
- name: USER_ID
value: "${USER_ID}"
- name: TENANT_ID
value: "${TENANT_ID}"
- name: SESSION_ID
value: "${SESSION_ID}"
volumeMounts:
- name: workspace-data
mountPath: /workspace
readOnly: true
- name: logs
mountPath: /logs
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: workspace-data
persistentVolumeClaim:
claimName: user-workspace-${USER_ID}
- name: logs
emptyDir: {}
Pros:
- ✅ Perfect pod-level isolation
- ✅ Tenant-specific monitoring
- ✅ Can access pod's PVC directly
- ✅ Simple lifecycle (dies with pod)
- ✅ Clear user/session attribution
Cons:
- ❌ More resource usage (1 monitor per pod)
- ❌ Need log aggregation solution
Verdict: ✅ RECOMMENDED - Best for multi-tenant isolation
Option 3: Centralized Service (Single Monitor)
apiVersion: apps/v1
kind: Deployment
metadata:
name: file-monitor-central
namespace: coditect-app
spec:
replicas: 3 # HA
template:
spec:
containers:
- name: monitor
image: us-central1-docker.pkg.dev/.../file-monitor:latest
# Monitors all workspaces via shared volume
Pros:
- ✅ Minimal resource overhead
- ✅ Centralized metrics
Cons:
- ❌ All workspaces in single volume (HUGE security risk)
- ❌ No tenant isolation
- ❌ Single point of failure
Verdict: ❌ NOT RECOMMENDED - Violates multi-tenant security
Recommended Architecture: HYBRID APPROACH
Overview
┌─────────────────────────────────────────────────────────────┐
│ User Pod (per user) │
│ │
│ ┌─────────────────┐ ┌────────────────────────────────┐ │
│ │ theia IDE │ │ File Monitor Sidecar │ │
│ │ (Main) │ │ - Watches /workspace │ │
│ │ │ │ - Outputs JSON logs │ │
│ │ Port 3000 │ │ - Tenant isolated │ │
│ └─────────────────┘ └────────────────────────────────┘ │
│ │ │ │
│ │ (writes) │ (monitors) │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ /workspace (PVC - user-workspace-${USER_ID}) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│ (logs to)
▼
┌──────────────────────────────┐
│ Logging Service │
│ (Fluentd/Vector/Filebeat) │
│ - Collects from all pods │
│ - Enriches with metadata │
│ - Sends to storage │
└──────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Elasticsearch│ │ FoundationDB │ │ Prometheus │
│ (Search) │ │ (Audit trail)│ │ (Metrics) │
└─────────────┘ └──────────────┘ └──────────────┘
Components
1. File Monitor Sidecar (Per User Pod)
Responsibilities:
- Monitor
/workspacefor file changes - Generate JSON event logs
- Add user/tenant/session context
- Output to stdout (for log aggregation)
Configuration:
# Start command in sidecar
/usr/local/bin/file-monitor \
/workspace \
--poll \
--debounce=500 \
--output=/dev/stdout \
--format=json \
--metadata="user_id=${USER_ID},tenant_id=${TENANT_ID},session_id=${SESSION_ID}"
Resource Limits:
- Requests: 64Mi RAM, 100m CPU
- Limits: 512Mi RAM, 500m CPU
- Expected: ~100Mi RAM @ typical load (100 evt/s)
2. Log Aggregation Service (Centralized)
Technology Options:
| Solution | Pros | Cons | Recommendation |
|---|---|---|---|
| Fluentd | K8s native, JSON parsing, FDB output | Resource heavy | ✅ Best for FDB |
| Vector | Rust (fast), eBPF support | Newer, less plugins | ⚠️ Good alternative |
| Filebeat | Lightweight, Elastic native | Weak FDB support | ❌ Skip |
| Google Cloud Logging | Fully managed, scalable | Vendor lock-in, cost | ⚠️ Option for analytics |
Recommended: Fluentd with custom FDB output plugin
Fluentd Configuration:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: coditect-app
spec:
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-forward
env:
- name: FLUENT_FDB_CLUSTER
value: "coditect:production@10.128.0.8:4500"
- name: FLUENT_FDB_SUBSPACE
value: "audit_logs"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluentd-config
mountPath: /fluentd/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluentd-config
configMap:
name: fluentd-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: coditect-app
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*file-monitor*.log
pos_file /var/log/fluentd-containers.log.pos
tag file.events.*
<parse>
@type json
time_key timestamp
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter file.events.**>
@type record_transformer
<record>
cluster_name "#{ENV['CLUSTER_NAME']}"
namespace ${record["kubernetes"]["namespace_name"]}
pod_name ${record["kubernetes"]["pod_name"]}
</record>
</filter>
<match file.events.**>
@type fdb
cluster_file /etc/foundationdb/fdb.cluster
subspace audit_logs
tenant_key tenant_id
<buffer>
@type file
path /var/log/fluentd-buffers/fdb.buffer
flush_interval 5s
retry_max_interval 30s
</buffer>
</match>
3. FoundationDB Audit Log Schema
Key Structure:
/audit_logs/
/{tenant_id}/
/{user_id}/
/{session_id}/
/{timestamp}/
event: {JSON event data}
Event Schema:
#[derive(Serialize, Deserialize)]
pub struct FileEvent {
pub timestamp: DateTime<Utc>,
pub tenant_id: Uuid,
pub user_id: Uuid,
pub session_id: Uuid,
pub event_type: FileEventType, // Create, Update, Delete, Move
pub path: String,
pub checksum: Option<String>,
pub size: Option<u64>,
pub metadata: serde_json::Value,
}
#[derive(Serialize, Deserialize)]
pub enum FileEventType {
Create,
Update,
Delete,
Move { from: String, to: String },
Access,
}
4. Agent Coordination Service (Backend API)
NEW: Agent Management Endpoints
File: backend/src/handlers/agents.rs (to be created)
use actix_web::{web, HttpResponse};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
pub struct AgentActivity {
pub session_id: Uuid,
pub agent_type: String, // "file-monitor", "code-gen", "review"
pub status: AgentStatus,
pub last_heartbeat: DateTime<Utc>,
pub files_claimed: Vec<String>,
}
#[derive(Serialize, Deserialize)]
pub enum AgentStatus {
Active,
Idle,
Stopped,
}
// POST /api/v5/agents/heartbeat
pub async fn agent_heartbeat(
session: web::Data<Session>,
req: web::Json<AgentActivity>,
) -> HttpResponse {
// Store in FDB: /agents/{tenant_id}/{session_id}/heartbeat
// Used for:
// - Detecting dead agents
// - File claim coordination
// - Activity monitoring
HttpResponse::Ok().json(/* ... */)
}
// GET /api/v5/agents/{session_id}/logs
pub async fn get_agent_logs(
session: web::Data<Session>,
session_id: web::Path<Uuid>,
) -> HttpResponse {
// Query FDB: /audit_logs/{tenant_id}/{user_id}/{session_id}/
HttpResponse::Ok().json(/* ... */)
}
REPLACES: V4 CODI2 agent coordination (was bash scripts + WebSocket)
Implementation Plan
Phase 1: File Monitor Sidecar (Week 1)
Tasks:
- ✅ File monitor already implemented (
src/file-monitor/) - 🔨 Create Docker image for sidecar deployment
- Base:
rust:1.70-slim→debian:bookworm-slim - Binary:
/usr/local/bin/file-monitor - Entrypoint: Script to parse env vars
- Base:
- 🔨 Update user pod template
- Add sidecar container
- Configure shared volume
- Set resource limits
- 🔨 Test with single user pod
Dockerfile:
FROM rust:1.70-slim as builder
WORKDIR /build
COPY src/file-monitor ./
RUN cargo build --release
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/target/release/file-monitor /usr/local/bin/
COPY docker/file-monitor-entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
USER nobody
ENTRYPOINT ["/entrypoint.sh"]
Phase 2: Log Aggregation (Week 2)
Tasks:
- 🔨 Deploy Fluentd DaemonSet
- 🔨 Create custom FDB output plugin
- Use
foundationdbRust crate - Implement Fluentd output API
- Use
- 🔨 Configure log routing
- File events → FDB audit_logs subspace
- Metrics → Prometheus
- 🔨 Test multi-pod aggregation
Custom Fluentd Plugin (Ruby):
# fluent-plugin-foundationdb/lib/fluent/plugin/out_fdb.rb
require 'fluent/plugin/output'
require 'ffi'
module Fluent::Plugin
class FoundationDBOutput < Output
Fluent::Plugin.register_output('fdb', self)
config_param :cluster_file, :string, default: '/etc/foundationdb/fdb.cluster'
config_param :subspace, :string, default: 'audit_logs'
config_param :tenant_key, :string, default: 'tenant_id'
def configure(conf)
super
# Load FDB client library
@fdb = FFI::Library.new('libfdb_c.so')
@fdb.attach_function :fdb_setup_network, [], :int
# ... (see V4 reference for complete implementation)
end
def write(chunk)
chunk.msgpack_each do |time, record|
tenant_id = record[@tenant_key]
user_id = record['user_id']
session_id = record['session_id']
# Write to FDB: /audit_logs/{tenant}/{user}/{session}/{timestamp}
key = "/#{@subspace}/#{tenant_id}/#{user_id}/#{session_id}/#{time}"
@fdb_transaction.set(key, record.to_json)
end
end
end
end
Phase 3: Agent Coordination API (Week 3)
Tasks:
- 🔨 Create
backend/src/handlers/agents.rs - 🔨 Implement heartbeat endpoint
- 🔨 Implement log query endpoint
- 🔨 Add file claim/release API
- 🔨 Update agent dashboard (frontend)
Phase 4: Prometheus Metrics Integration (Week 4)
Tasks:
- 🔨 Expose file-monitor metrics on
:9090/metrics - 🔨 Configure Prometheus scraping
- 🔨 Create Grafana dashboards
- 🔨 Set up alerts (see production.md)
Prometheus ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: file-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: file-monitor
endpoints:
- port: metrics
interval: 30s
path: /metrics
Resource Planning
Per-User Pod Resources
| Component | Requests | Limits | Notes |
|---|---|---|---|
| theia IDE | 512Mi / 500m | 2Gi / 2000m | Main workload |
| File Monitor | 64Mi / 100m | 512Mi / 500m | Sidecar |
| Total | 576Mi / 600m | 2.5Gi / 2.5 CPU | Per user pod |
Example: 100 concurrent users
- Requests: 57.6 GB RAM, 60 CPU cores
- Limits: 250 GB RAM, 250 CPU cores
- Node Pool: n2-standard-32 (32 vCPU, 128GB RAM) × 3 nodes
Centralized Services
| Service | Replicas | Resources | Total |
|---|---|---|---|
| Fluentd | 3 (DaemonSet) | 256Mi / 500m | 768Mi / 1.5 CPU |
| Prometheus | 1 | 4Gi / 2000m | 4Gi / 2 CPU |
| Grafana | 1 | 1Gi / 500m | 1Gi / 0.5 CPU |
Security Considerations
Tenant Isolation
Critical: File monitor sidecar MUST:
- ✅ Run as non-root user (
nobody) - ✅ Mount workspace volume as read-only
- ✅ Use resource limits (prevent DoS)
- ✅ Include tenant_id in ALL logs
- ✅ No network access (except metrics)
Network Policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: file-monitor-policy
namespace: coditect-app
spec:
podSelector:
matchLabels:
component: file-monitor
policyTypes:
- Egress
egress:
# Only allow metrics export
- to:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
Audit Trail Integrity
FoundationDB Guarantees:
- ✅ ACID transactions (atomic writes)
- ✅ Multi-version concurrency (no overwrites)
- ✅ Tenant isolation (prefix-based)
- ✅ Backup to GCS (point-in-time recovery)
Retention Policy:
# Automated cleanup of old audit logs
audit_logs:
retention_days: 90 # Legal compliance
archive_to_gcs: true
compress: true
Migration from V4 CODI2
Functionality Mapping
| V4 CODI2 Feature | V5 Implementation | Status |
|---|---|---|
| File monitoring | file-monitor sidecar | ✅ Done |
| Log aggregation | Fluentd → FDB | 🔨 To build |
| Agent coordination | Backend API /agents/ | 🔨 To build |
| Export handling | Backend API /exports/ | 🔨 To build |
| Duplicate detection | File monitor checksums | ✅ Done |
| Session attribution | FDB + JWT | ✅ Done |
| Metrics | Prometheus | ✅ Done |
Breaking Changes
V4 → V5:
- ❌ No standalone CODI2 binary
- ❌ No bash script orchestration
- ❌ No local file watchers on host
- ✅ All functionality via API + sidecars
Testing Strategy
Unit Tests
- ✅ File monitor (already exists:
src/file-monitor/tests/) - 🔨 Fluentd FDB plugin
- 🔨 Agent coordination API
Integration Tests
Test Scenarios:
- Single user pod: Deploy pod with sidecar, trigger file events, verify FDB writes
- Multi-tenant: 10 users, concurrent file operations, check tenant isolation
- Log aggregation: 1000 events/sec across 50 pods, verify no drops
- Agent coordination: 5 concurrent agents claiming files, detect conflicts
- Failover: Kill fluentd pod, verify buffering and recovery
Test Script (tests/integration/test-file-monitor-sidecar.sh):
#!/bin/bash
set -e
# Deploy test user pod
kubectl apply -f k8s/test/user-pod-with-monitor.yaml
# Wait for ready
kubectl wait --for=condition=Ready pod/test-user-pod -n coditect-app
# Trigger file events
kubectl exec test-user-pod -c workspace -- bash -c "
echo 'test' > /workspace/test.txt
echo 'update' >> /workspace/test.txt
rm /workspace/test.txt
"
# Wait for logs to propagate
sleep 10
# Query FDB for events
curl -H "Authorization: Bearer $TEST_TOKEN" \
http://api.coditect.ai/api/v5/agents/test-session/logs | jq
# Verify events
expected_events=3 # create, update, delete
actual_events=$(curl -s ... | jq '.events | length')
if [ "$actual_events" -eq "$expected_events" ]; then
echo "✅ Test passed"
else
echo "❌ Test failed: expected $expected_events, got $actual_events"
exit 1
fi
Performance Tests
Load Test (tests/performance/load-test-monitoring.sh):
#!/bin/bash
# Simulate 100 users, 100 events/sec each = 10K events/sec total
for user_id in $(seq 1 100); do
kubectl exec user-pod-$user_id -c workspace -- bash -c "
for i in {1..100}; do
echo 'data' > /workspace/file-\$i.txt
sleep 0.01
done
" &
done
wait
# Check metrics
kubectl exec prometheus-0 -n monitoring -- promtool query instant \
'rate(fs_monitor_events_published_total[1m])'
# Expected: ~10000 events/sec
Deployment Checklist
Prerequisites
- FoundationDB cluster running (3 replicas minimum)
- Prometheus operator installed
- Grafana deployed
- GKE cluster with monitoring enabled
- Docker images built and pushed to Artifact Registry
Phase 1 Deployment
- Build file-monitor Docker image
- Push to
us-central1-docker.pkg.dev/.../file-monitor:latest - Update user pod template with sidecar
- Deploy test user pod
- Verify file monitor starts and outputs logs
- Check resource usage (should be < 100Mi RAM)
Phase 2 Deployment
- Deploy Fluentd DaemonSet
- Install custom FDB plugin
- Configure log routing
- Deploy test with 10 user pods
- Verify logs reach FDB
- Query audit logs via API
Phase 3 Deployment
- Deploy agent coordination API
- Update frontend dashboard
- Test agent heartbeat
- Test file claim/release
- Load test with 100 users
Phase 4 Deployment
- Configure Prometheus scraping
- Import Grafana dashboards
- Set up alerts
- Test alert firing
- Document runbooks
Alternatives Considered
Alternative 1: Google Cloud Logging Only
Approach: Skip Fluentd, send all logs to GCP Cloud Logging
Pros:
- Fully managed
- Infinite scalability
- Built-in search/analytics
Cons:
- Vendor lock-in
- Cost at scale ($0.50/GB ingestion)
- No FDB integration (audit trail split)
Decision: ❌ Rejected - Audit trail must be in FDB for compliance
Alternative 2: Agent-less Monitoring (API-driven)
Approach: IDE sends file events to API directly, no sidecar
Pros:
- Simpler deployment
- Less resource usage
Cons:
- Requires IDE modification
- Misses non-IDE file changes
- IDE can bypass monitoring (security risk)
Decision: ❌ Rejected - Monitoring must be independent of IDE
Alternative 3: eBPF-based Monitoring
Approach: Use eBPF to monitor syscalls at kernel level
Pros:
- Zero overhead
- Can't be bypassed
- Detailed syscall info
Cons:
- Requires privileged containers
- Complex to implement
- Linux-only
Decision: ⏳ Future consideration for advanced monitoring
Cost Analysis
Resource Costs (100 concurrent users)
| Component | Resources | GKE Cost (us-central1) | Monthly Cost |
|---|---|---|---|
| User pods (100×) | 600m CPU, 576Mi RAM | $0.031/vCPU-hr, $0.0034/GB-hr | ~$1,350/mo |
| File monitor sidecars (100×) | 100m CPU, 64Mi RAM | Included in user pod | $0 |
| Fluentd (3 nodes) | 1.5 CPU, 768Mi RAM | $0.031/vCPU-hr | ~$35/mo |
| Prometheus | 2 CPU, 4Gi RAM | $0.031/vCPU-hr | ~$50/mo |
| Total | ~$1,435/mo |
Notes:
- Sidecar costs are incremental to user pod (already budgeted)
- FDB storage separate (see FDB pricing)
- Does NOT include GCP Cloud Logging (optional)
Comparison to V4 CODI2
V4 Approach (Dedicated CODI2 service):
- Central CODI2 deployment: 4 CPU, 8Gi RAM = ~$90/mo
- But: Required hostPath access (security risk)
V5 Approach (Sidecar):
- Distributed in user pods: No additional infra cost
- Better isolation and scalability
- Savings: ~$90/mo on infrastructure, PRICELESS on security
Rollout Strategy
Week 1: Internal Testing
- Deploy to
devnamespace - 5 internal users
- Monitor metrics and logs
- Fix bugs
Week 2: Limited Beta
- Deploy to
stagingnamespace - 20 beta users
- Load test with realistic workloads
- Tune resource limits
Week 3: Production Rollout
- Deploy to
productionnamespace - 10% traffic (canary deployment)
- Monitor error rates
- Gradual ramp to 100%
Week 4: Full Production
- All users on new architecture
- Decommission V4 CODI2 references
- Update documentation
Success Metrics
Performance Metrics
| Metric | Target | Measurement |
|---|---|---|
| Monitoring Latency | < 100ms (p99) | fs_monitor_processing_latency_us |
| Log Ingestion Rate | 10K events/sec | Fluentd throughput |
| Resource Usage | < 100Mi RAM per monitor | kubectl top pods |
| Event Drop Rate | 0% | fs_monitor_events_dropped_total |
Reliability Metrics
| Metric | Target | Measurement |
|---|---|---|
| Uptime | 99.9% | Prometheus up metric |
| Log Delivery | 99.99% | FDB query count vs event count |
| Recovery Time | < 1 min | Fluentd pod restart time |
Maintenance & Operations
Daily Operations
# Check sidecar health across all pods
kubectl get pods -n coditect-app -l component=file-monitor
# View aggregated metrics
curl http://prometheus:9090/api/v1/query?query=sum(fs_monitor_events_published_total)
# Check Fluentd buffer status
kubectl exec -it fluentd-xxx -n coditect-app -- ls -lh /var/log/fluentd-buffers/
# Query audit logs for user
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
"http://api.coditect.ai/api/v5/agents/logs?user_id=xxx&start=2025-10-19T00:00:00Z"
Incident Response
Scenario: File Monitor Crashing
- Check pod logs:
kubectl logs user-pod-xxx -c file-monitor -n coditect-app - Check resource limits:
kubectl describe pod user-pod-xxx -n coditect-app - Increase limits if OOMKilled
- Restart pod:
kubectl delete pod user-pod-xxx -n coditect-app(recreated by deployment)
Scenario: Logs Not Reaching FDB
- Check Fluentd status:
kubectl logs fluentd-xxx -n coditect-app - Verify FDB connectivity:
kubectl exec fluentd-xxx -- fdbcli --exec "status" - Check buffer:
ls /var/log/fluentd-buffers/ - Restart Fluentd if needed:
kubectl rollout restart daemonset/fluentd -n coditect-app
Backup & Disaster Recovery
Audit Log Backup (Automated):
# CronJob to backup audit logs to GCS
apiVersion: batch/v1
kind: CronJob
metadata:
name: audit-log-backup
namespace: coditect-app
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: gcr.io/google.com/cloudsdktool/cloud-sdk:slim
command:
- /bin/bash
- -c
- |
# Query FDB for previous day's logs
curl -H "Authorization: Bearer $BACKUP_TOKEN" \
"http://api.coditect.ai/api/v5/agents/logs?start=$(date -d '1 day ago' -I)&end=$(date -I)" \
> /tmp/audit-logs-$(date -I).json
# Upload to GCS
gsutil cp /tmp/audit-logs-$(date -I).json gs://coditect-audit-logs/
restartPolicy: OnFailure
Future Enhancements
Phase 5: Real-time Agent Coordination (Future)
WebSocket Gateway for agent events:
User IDE ←→ WebSocket ←→ Agent Coordination Service
↓ ↓
File Monitor Sidecar FDB (real-time queries)
Benefits:
- Real-time file conflict detection
- Live agent activity feed
- Immediate notifications
Phase 6: ML-Powered Anomaly Detection (Future)
Use Case: Detect unusual file activity (ransomware, data exfiltration)
Architecture:
File Events → Kafka → Stream Processor → ML Model → Alerts
Metrics:
- Unusual file access patterns
- High deletion rates
- Sensitive data exposure
Phase 7: Cross-Tenant Analytics (Future)
Use Case: Aggregate metrics across all tenants (admin view)
Queries:
- Most active users
- Popular file types
- Resource usage trends
Appendix
A. Reference Documentation
T2 File Monitor:
- Implementation:
src/file-monitor/ - Documentation:
docs/11-analysis/file-monitor/ - Production guide:
docs/11-analysis/file-monitor/production.md
V4 CODI2 Reference:
- Implementation:
archive/coditect-v4/codi2/ - Docker configs:
archive/coditect-v4/codi2/config/docker/ - K8s patterns:
archive/coditect-v4/infrastructure/kubernetes/
GKE Deployment:
- Current manifests:
k8s/ - Deployment guide:
docs/06-backend/deploy.md
B. Glossary
| Term | Definition |
|---|---|
| Sidecar | Helper container running alongside main container in same pod |
| DaemonSet | K8s workload that runs one pod per node |
| Fluentd | Open-source log aggregation tool |
| FDB | FoundationDB - distributed key-value store |
| inotify | Linux kernel subsystem for file monitoring |
| Prometheus | Metrics collection and alerting system |
C. Contact & Support
Architecture Questions: DevOps team File Monitor Issues: Backend team FDB Schema Changes: Database team Security Concerns: Security team
Document Version: 1.0 Last Updated: 2025-10-19 Next Review: After Phase 1 deployment (Week 1)