Skip to main content

Operations Guide

Day-to-day operations and administration guide for CODITECT DMS.


Daily Operations

Morning Checklist

  1. Check System Health

    # Health status
    curl https://dms-api.coditect.ai/health/ready

    # Pod status
    kubectl get pods -n coditect-dms
  2. Review Alerts

    • Check Grafana dashboards
    • Review PagerDuty incidents
    • Check error logs
  3. Monitor Processing Queue

    # Check Celery queue
    kubectl exec -it deployment/dms-worker -n coditect-dms -- \
    celery -A src.backend.tasks inspect active
  4. Verify Backups

    gcloud sql backups list --instance=coditect-dms-db | head -5

Monitoring

Grafana Dashboards

Access Grafana at: https://monitoring.coditect.ai

Key Dashboards:

  • API Overview: Request rate, latency, errors
  • Resource Utilization: CPU, memory, disk
  • Business Metrics: Documents, searches, embeddings
  • SLO Dashboard: Availability, latency SLOs

Key Metrics

MetricWarningCriticalAction
API Error Rate>5%>10%Check logs, scale
P95 Latency>1s>2sScale, optimize
CPU Usage>70%>85%Scale out
Memory Usage>80%>90%Scale up/out
Queue Depth>100>500Scale workers
Disk Usage>70%>85%Expand storage

Alert Response

High Error Rate

# 1. Check recent logs
kubectl logs -f deployment/coditect-dms-api -n coditect-dms --since=10m | grep ERROR

# 2. Check pod health
kubectl describe pods -l app=dms-api -n coditect-dms

# 3. Scale if needed
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms

High Latency

# 1. Check database connections
kubectl exec -it deployment/coditect-dms-api -n coditect-dms -- \
python -c "from src.backend.database import get_stats; print(get_stats())"

# 2. Check Redis latency
redis-cli -h $REDIS_HOST --latency

# 3. Scale API pods
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms

Common Operations

Scaling

Scale API Pods

# Manual scale
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms

# Check HPA status
kubectl get hpa -n coditect-dms

Scale Workers

kubectl scale deployment dms-worker --replicas=10 -n coditect-dms

Adjust HPA Limits

kubectl patch hpa dms-api-hpa -n coditect-dms \
-p '{"spec":{"maxReplicas":30}}'

Deployments

Rolling Update

# Update image
kubectl set image deployment/coditect-dms-api \
api=gcr.io/coditect-prod/dms-api:1.0.1 \
-n coditect-dms

# Monitor rollout
kubectl rollout status deployment/coditect-dms-api -n coditect-dms

Rollback

# Rollback to previous
kubectl rollout undo deployment/coditect-dms-api -n coditect-dms

# Rollback to specific revision
kubectl rollout undo deployment/coditect-dms-api -n coditect-dms --to-revision=5

Restart All Pods

kubectl rollout restart deployment/coditect-dms-api -n coditect-dms

Database Operations

Connect to Database

# Start proxy
cloud_sql_proxy -instances=coditect-prod:us-central1:coditect-dms-db=tcp:5432 &

# Connect
psql "postgresql://dms_user:PASSWORD@localhost:5432/dms"

Run Migrations

# Check current version
alembic current

# Run pending migrations
alembic upgrade head

# Create new migration
alembic revision --autogenerate -m "description"

Database Maintenance

-- Vacuum tables
VACUUM ANALYZE document;
VACUUM ANALYZE chunk;

-- Reindex
REINDEX INDEX idx_chunk_embedding;

-- Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Cache Management

Clear Redis Cache

# Connect to Redis
redis-cli -h $REDIS_HOST

# Clear specific keys
redis-cli -h $REDIS_HOST KEYS "search:*" | xargs redis-cli -h $REDIS_HOST DEL

# Flush all (DANGEROUS)
redis-cli -h $REDIS_HOST FLUSHDB

Check Cache Stats

redis-cli -h $REDIS_HOST INFO stats
redis-cli -h $REDIS_HOST INFO memory

Log Management

View Logs

# API logs
kubectl logs -f deployment/coditect-dms-api -n coditect-dms

# Worker logs
kubectl logs -f deployment/dms-worker -n coditect-dms

# Filter errors
kubectl logs deployment/coditect-dms-api -n coditect-dms | grep ERROR

# Export logs
kubectl logs deployment/coditect-dms-api -n coditect-dms --since=1h > api-logs.txt

Cloud Logging Query

resource.type="k8s_container"
resource.labels.namespace_name="coditect-dms"
severity>=ERROR

User Management

Add User to Tenant

Using API:

curl -X POST https://dms-api.coditect.ai/api/v1/tenants/me/users \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"email": "new.user@company.com",
"name": "New User",
"role": "editor"
}'

Reset User Password

# Generate password reset link
curl -X POST https://dms-api.coditect.ai/api/v1/auth/password-reset \
-H "Content-Type: application/json" \
-d '{"email": "user@company.com"}'

Deactivate User

curl -X PATCH https://dms-api.coditect.ai/api/v1/tenants/me/users/{user_id} \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"is_active": false}'

Rotate API Keys

# List current keys
curl https://dms-api.coditect.ai/api/v1/tenants/me/api-keys \
-H "Authorization: Bearer $ADMIN_TOKEN"

# Create new key
curl -X POST https://dms-api.coditect.ai/api/v1/tenants/me/api-keys \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "Rotated Key", "scopes": ["read", "search", "write"]}'

# Delete old key
curl -X DELETE https://dms-api.coditect.ai/api/v1/tenants/me/api-keys/{key_id} \
-H "Authorization: Bearer $ADMIN_TOKEN"

Tenant Management

View Tenant Usage

curl https://dms-api.coditect.ai/api/v1/tenants/me/usage \
-H "Authorization: Bearer $ADMIN_TOKEN"

Update Tenant Tier

-- Connect to database
UPDATE tenant
SET subscription_tier = 'enterprise'
WHERE id = 'tenant-uuid';

Suspend Tenant

UPDATE tenant
SET status = 'suspended'
WHERE id = 'tenant-uuid';

Document Reprocessing

Reprocess Single Document

curl -X POST https://dms-api.coditect.ai/api/v1/documents/{doc_id}/reprocess \
-H "X-API-Key: $API_KEY"

Bulk Reprocess

-- Mark documents for reprocessing
UPDATE document
SET status = 'pending', updated_at = NOW()
WHERE tenant_id = 'tenant-uuid'
AND document_type = 'guide';

Clear Failed Jobs

# Check failed jobs
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks purge -f

Maintenance Windows

Planned Maintenance

  1. Notify Users (24h before)

    • Update status page
    • Send email notification
  2. Pre-Maintenance

    # Scale down API
    kubectl scale deployment coditect-dms-api --replicas=1 -n coditect-dms

    # Drain processing queue
    kubectl scale deployment dms-worker --replicas=0 -n coditect-dms
  3. Perform Maintenance

    • Database maintenance
    • Infrastructure updates
    • Security patches
  4. Post-Maintenance

    # Verify database
    alembic current

    # Scale up
    kubectl scale deployment coditect-dms-api --replicas=3 -n coditect-dms
    kubectl scale deployment dms-worker --replicas=3 -n coditect-dms

    # Verify health
    curl https://dms-api.coditect.ai/health/ready
  5. Update Status Page


Security Operations

Rotate Secrets

# Generate new JWT secret
NEW_SECRET=$(openssl rand -base64 64)

# Update in Secret Manager
echo -n "$NEW_SECRET" | \
gcloud secrets versions add dms-jwt-secret --data-file=-

# Restart pods to pick up new secret
kubectl rollout restart deployment/coditect-dms-api -n coditect-dms

Review Audit Logs

# Cloud Audit Logs
gcloud logging read 'logName:"cloudaudit.googleapis.com"' \
--project=coditect-prod \
--limit=100

Security Incident Response

See disaster-recovery-runbook.md for security incident procedures.


Backup Verification

Weekly Backup Test

  1. Create Test Database

    gcloud sql instances create dms-backup-test \
    --source-instance=coditect-dms-db \
    --clone
  2. Verify Data

    # Connect and check
    cloud_sql_proxy -instances=coditect-prod:us-central1:dms-backup-test=tcp:5433 &

    psql "postgresql://dms_user:PASSWORD@localhost:5433/dms" -c "SELECT COUNT(*) FROM document;"
  3. Cleanup

    gcloud sql instances delete dms-backup-test

Troubleshooting

Pod CrashLoopBackOff

# Check logs
kubectl logs <pod-name> -n coditect-dms --previous

# Check events
kubectl describe pod <pod-name> -n coditect-dms

# Common causes:
# - Database connection issues
# - Missing secrets
# - OOM killed (increase memory)

Database Connection Issues

# Test connection from pod
kubectl exec -it deployment/coditect-dms-api -n coditect-dms -- \
python -c "
import asyncpg
import asyncio
async def test():
conn = await asyncpg.connect(DATABASE_URL)
print(await conn.fetchval('SELECT 1'))
asyncio.run(test())
"

# Check Cloud SQL connections
gcloud sql instances describe coditect-dms-db \
--format="value(settings.userLabels)"

Slow Searches

-- Check embedding index
EXPLAIN ANALYZE
SELECT id, embedding <-> '[0.1, 0.2, ...]'::vector AS distance
FROM chunk
ORDER BY distance
LIMIT 10;

-- Rebuild index if needed
REINDEX INDEX CONCURRENTLY idx_chunk_embedding;

Queue Backlog

# Check queue size
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks inspect reserved

# Scale workers
kubectl scale deployment dms-worker --replicas=10 -n coditect-dms

# Clear stuck tasks
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks purge -f

Runbooks


Support Contacts

RoleContactWhen
On-Call EngineerPagerDuty24/7 for P0/P1
Engineering Leadeng-lead@az1.aiBusiness hours
Security Leadsecurity@az1.aiSecurity issues
CTO1@az1.aiEscalations