Operations Guide
Day-to-day operations and administration guide for CODITECT DMS.
Daily Operations
Morning Checklist
-
Check System Health
# Health status
curl https://dms-api.coditect.ai/health/ready
# Pod status
kubectl get pods -n coditect-dms -
Review Alerts
- Check Grafana dashboards
- Review PagerDuty incidents
- Check error logs
-
Monitor Processing Queue
# Check Celery queue
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks inspect active -
Verify Backups
gcloud sql backups list --instance=coditect-dms-db | head -5
Monitoring
Grafana Dashboards
Access Grafana at: https://monitoring.coditect.ai
Key Dashboards:
- API Overview: Request rate, latency, errors
- Resource Utilization: CPU, memory, disk
- Business Metrics: Documents, searches, embeddings
- SLO Dashboard: Availability, latency SLOs
Key Metrics
| Metric | Warning | Critical | Action |
|---|---|---|---|
| API Error Rate | >5% | >10% | Check logs, scale |
| P95 Latency | >1s | >2s | Scale, optimize |
| CPU Usage | >70% | >85% | Scale out |
| Memory Usage | >80% | >90% | Scale up/out |
| Queue Depth | >100 | >500 | Scale workers |
| Disk Usage | >70% | >85% | Expand storage |
Alert Response
High Error Rate
# 1. Check recent logs
kubectl logs -f deployment/coditect-dms-api -n coditect-dms --since=10m | grep ERROR
# 2. Check pod health
kubectl describe pods -l app=dms-api -n coditect-dms
# 3. Scale if needed
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms
High Latency
# 1. Check database connections
kubectl exec -it deployment/coditect-dms-api -n coditect-dms -- \
python -c "from src.backend.database import get_stats; print(get_stats())"
# 2. Check Redis latency
redis-cli -h $REDIS_HOST --latency
# 3. Scale API pods
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms
Common Operations
Scaling
Scale API Pods
# Manual scale
kubectl scale deployment coditect-dms-api --replicas=10 -n coditect-dms
# Check HPA status
kubectl get hpa -n coditect-dms
Scale Workers
kubectl scale deployment dms-worker --replicas=10 -n coditect-dms
Adjust HPA Limits
kubectl patch hpa dms-api-hpa -n coditect-dms \
-p '{"spec":{"maxReplicas":30}}'
Deployments
Rolling Update
# Update image
kubectl set image deployment/coditect-dms-api \
api=gcr.io/coditect-prod/dms-api:1.0.1 \
-n coditect-dms
# Monitor rollout
kubectl rollout status deployment/coditect-dms-api -n coditect-dms
Rollback
# Rollback to previous
kubectl rollout undo deployment/coditect-dms-api -n coditect-dms
# Rollback to specific revision
kubectl rollout undo deployment/coditect-dms-api -n coditect-dms --to-revision=5
Restart All Pods
kubectl rollout restart deployment/coditect-dms-api -n coditect-dms
Database Operations
Connect to Database
# Start proxy
cloud_sql_proxy -instances=coditect-prod:us-central1:coditect-dms-db=tcp:5432 &
# Connect
psql "postgresql://dms_user:PASSWORD@localhost:5432/dms"
Run Migrations
# Check current version
alembic current
# Run pending migrations
alembic upgrade head
# Create new migration
alembic revision --autogenerate -m "description"
Database Maintenance
-- Vacuum tables
VACUUM ANALYZE document;
VACUUM ANALYZE chunk;
-- Reindex
REINDEX INDEX idx_chunk_embedding;
-- Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
Cache Management
Clear Redis Cache
# Connect to Redis
redis-cli -h $REDIS_HOST
# Clear specific keys
redis-cli -h $REDIS_HOST KEYS "search:*" | xargs redis-cli -h $REDIS_HOST DEL
# Flush all (DANGEROUS)
redis-cli -h $REDIS_HOST FLUSHDB
Check Cache Stats
redis-cli -h $REDIS_HOST INFO stats
redis-cli -h $REDIS_HOST INFO memory
Log Management
View Logs
# API logs
kubectl logs -f deployment/coditect-dms-api -n coditect-dms
# Worker logs
kubectl logs -f deployment/dms-worker -n coditect-dms
# Filter errors
kubectl logs deployment/coditect-dms-api -n coditect-dms | grep ERROR
# Export logs
kubectl logs deployment/coditect-dms-api -n coditect-dms --since=1h > api-logs.txt
Cloud Logging Query
resource.type="k8s_container"
resource.labels.namespace_name="coditect-dms"
severity>=ERROR
User Management
Add User to Tenant
Using API:
curl -X POST https://dms-api.coditect.ai/api/v1/tenants/me/users \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"email": "new.user@company.com",
"name": "New User",
"role": "editor"
}'
Reset User Password
# Generate password reset link
curl -X POST https://dms-api.coditect.ai/api/v1/auth/password-reset \
-H "Content-Type: application/json" \
-d '{"email": "user@company.com"}'
Deactivate User
curl -X PATCH https://dms-api.coditect.ai/api/v1/tenants/me/users/{user_id} \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"is_active": false}'
Rotate API Keys
# List current keys
curl https://dms-api.coditect.ai/api/v1/tenants/me/api-keys \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Create new key
curl -X POST https://dms-api.coditect.ai/api/v1/tenants/me/api-keys \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "Rotated Key", "scopes": ["read", "search", "write"]}'
# Delete old key
curl -X DELETE https://dms-api.coditect.ai/api/v1/tenants/me/api-keys/{key_id} \
-H "Authorization: Bearer $ADMIN_TOKEN"
Tenant Management
View Tenant Usage
curl https://dms-api.coditect.ai/api/v1/tenants/me/usage \
-H "Authorization: Bearer $ADMIN_TOKEN"
Update Tenant Tier
-- Connect to database
UPDATE tenant
SET subscription_tier = 'enterprise'
WHERE id = 'tenant-uuid';
Suspend Tenant
UPDATE tenant
SET status = 'suspended'
WHERE id = 'tenant-uuid';
Document Reprocessing
Reprocess Single Document
curl -X POST https://dms-api.coditect.ai/api/v1/documents/{doc_id}/reprocess \
-H "X-API-Key: $API_KEY"
Bulk Reprocess
-- Mark documents for reprocessing
UPDATE document
SET status = 'pending', updated_at = NOW()
WHERE tenant_id = 'tenant-uuid'
AND document_type = 'guide';
Clear Failed Jobs
# Check failed jobs
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks purge -f
Maintenance Windows
Planned Maintenance
-
Notify Users (24h before)
- Update status page
- Send email notification
-
Pre-Maintenance
# Scale down API
kubectl scale deployment coditect-dms-api --replicas=1 -n coditect-dms
# Drain processing queue
kubectl scale deployment dms-worker --replicas=0 -n coditect-dms -
Perform Maintenance
- Database maintenance
- Infrastructure updates
- Security patches
-
Post-Maintenance
# Verify database
alembic current
# Scale up
kubectl scale deployment coditect-dms-api --replicas=3 -n coditect-dms
kubectl scale deployment dms-worker --replicas=3 -n coditect-dms
# Verify health
curl https://dms-api.coditect.ai/health/ready -
Update Status Page
Security Operations
Rotate Secrets
# Generate new JWT secret
NEW_SECRET=$(openssl rand -base64 64)
# Update in Secret Manager
echo -n "$NEW_SECRET" | \
gcloud secrets versions add dms-jwt-secret --data-file=-
# Restart pods to pick up new secret
kubectl rollout restart deployment/coditect-dms-api -n coditect-dms
Review Audit Logs
# Cloud Audit Logs
gcloud logging read 'logName:"cloudaudit.googleapis.com"' \
--project=coditect-prod \
--limit=100
Security Incident Response
See disaster-recovery-runbook.md for security incident procedures.
Backup Verification
Weekly Backup Test
-
Create Test Database
gcloud sql instances create dms-backup-test \
--source-instance=coditect-dms-db \
--clone -
Verify Data
# Connect and check
cloud_sql_proxy -instances=coditect-prod:us-central1:dms-backup-test=tcp:5433 &
psql "postgresql://dms_user:PASSWORD@localhost:5433/dms" -c "SELECT COUNT(*) FROM document;" -
Cleanup
gcloud sql instances delete dms-backup-test
Troubleshooting
Pod CrashLoopBackOff
# Check logs
kubectl logs <pod-name> -n coditect-dms --previous
# Check events
kubectl describe pod <pod-name> -n coditect-dms
# Common causes:
# - Database connection issues
# - Missing secrets
# - OOM killed (increase memory)
Database Connection Issues
# Test connection from pod
kubectl exec -it deployment/coditect-dms-api -n coditect-dms -- \
python -c "
import asyncpg
import asyncio
async def test():
conn = await asyncpg.connect(DATABASE_URL)
print(await conn.fetchval('SELECT 1'))
asyncio.run(test())
"
# Check Cloud SQL connections
gcloud sql instances describe coditect-dms-db \
--format="value(settings.userLabels)"
Slow Searches
-- Check embedding index
EXPLAIN ANALYZE
SELECT id, embedding <-> '[0.1, 0.2, ...]'::vector AS distance
FROM chunk
ORDER BY distance
LIMIT 10;
-- Rebuild index if needed
REINDEX INDEX CONCURRENTLY idx_chunk_embedding;
Queue Backlog
# Check queue size
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks inspect reserved
# Scale workers
kubectl scale deployment dms-worker --replicas=10 -n coditect-dms
# Clear stuck tasks
kubectl exec -it deployment/dms-worker -n coditect-dms -- \
celery -A src.backend.tasks purge -f
Runbooks
Support Contacts
| Role | Contact | When |
|---|---|---|
| On-Call Engineer | PagerDuty | 24/7 for P0/P1 |
| Engineering Lead | eng-lead@az1.ai | Business hours |
| Security Lead | security@az1.ai | Security issues |
| CTO | 1@az1.ai | Escalations |