Post-Deploy Canary Monitor
Purpose
- Continuously monitor canary deployment metrics against baseline (previous production version)
- Run statistical tests (Mann-Whitney U, Kolmogorov-Smirnov) to detect performance/error regressions
- Auto-advance traffic to canary if metrics pass statistical significance threshold (p > 0.05)
- Trigger automatic rollback if critical metrics degrade significantly (p < 0.01)
- Generate detailed canary analysis report with statistical evidence and recommendations
Trigger
| Property | Value |
|---|---|
| Event | post-deploy-canary, continuous during soak period |
| Blocking | No (can trigger rollback via side effect) |
| Timeout | 1800s (30 minutes) |
| Failure Mode | If regression detected: Auto-rollback (requires ops confirmation) |
| Canary Soak Period | Configurable (default: 15 minutes @ 5% traffic, 10 minutes @ 25%, then full) |
Behavior
When Triggered
The hook executes continuously during the canary soak period. It:
-
Collects Metrics: Samples from canary and baseline simultaneously
- Response latency (p50, p95, p99 percentiles)
- Error rate (4xx, 5xx by endpoint)
- Throughput (requests/sec)
- Resource utilization (CPU, memory, disk I/O)
-
Statistical Analysis: Applies hypothesis tests
- Mann-Whitney U Test: Compares latency distributions (non-parametric)
- Kolmogorov-Smirnov Test: Detects distribution shape changes
- Chi-Square Test: Analyzes error rate differences
- Threshold: p-value > 0.05 for "no significant difference"
-
Traffic Shifting Decisions:
- Phase 1 (5% traffic, 5min): Collect baseline sample
- Phase 2 (25% traffic, 5min): Run statistical tests
- Phase 3 (100% traffic if passing, 5min): Final validation or rollback
Configuration
Create .coditect/config/canary-monitor-hook.json:
{
"enabled": true,
"timeout_seconds": 1800,
"canary_phases": [
{
"phase": 1,
"traffic_percentage": 5,
"duration_seconds": 300,
"min_samples": 100,
"purpose": "baseline-collection"
},
{
"phase": 2,
"traffic_percentage": 25,
"duration_seconds": 300,
"min_samples": 500,
"purpose": "statistical-comparison"
},
{
"phase": 3,
"traffic_percentage": 100,
"duration_seconds": 300,
"min_samples": 1000,
"purpose": "full-validation"
}
],
"metrics": [
{
"name": "response-latency",
"type": "histogram",
"percentiles": [50, 95, 99],
"threshold_percent_increase": 10,
"statistical_test": "mann-whitney-u",
"critical": true
},
{
"name": "error-rate",
"type": "counter",
"threshold_percent_increase": 5,
"statistical_test": "chi-square",
"critical": true
},
{
"name": "throughput",
"type": "gauge",
"threshold_percent_decrease": 10,
"statistical_test": "mann-whitney-u",
"critical": false
},
{
"name": "cpu-usage",
"type": "gauge",
"threshold_percent_increase": 15,
"statistical_test": "mann-whitney-u",
"critical": false
}
],
"statistical_thresholds": {
"pass_p_value": 0.05,
"warning_p_value": 0.01,
"rollback_p_value": 0.001,
"min_effect_size": 0.2
},
"traffic_advancing": {
"auto_advance_on_pass": true,
"hold_before_full_traffic": 120
},
"auto_rollback": {
"enabled": true,
"requires_ops_confirmation": true,
"rollback_timeout_seconds": 60
},
"notifications": {
"on_phase_complete": ["slack-deployments"],
"on_regression_detected": ["slack-ops", "pagerduty"],
"on_rollback": ["slack-ops", "email-team-lead"]
}
}
Integration
The hook integrates with:
- Skill:
canary-analysis-patterns- Statistical test logic and metric collection - Metrics Store: Queries Prometheus/DataDog for metric samples
- Traffic Control: Updates ingress/load balancer rules for traffic shifting
- Rollback System: Triggers rollback if regressions detected
- Monitoring Dashboard: Feeds canary analysis results to dashboard
- Notifications: Alerts on-call team for critical decisions
Output
Phase 1: Baseline Collection (5% Traffic, 5min)
[14:32:15] Canary Phase 1: Baseline Collection
Traffic: 5% to canary
Duration: 5 minutes
Collecting metrics from baseline...
Samples collected: 487
- Response latency p50: 42ms
- Response latency p95: 185ms
- Error rate: 0.12%
Baseline ready for comparison. Phase 2 starting...
Phase 2: Statistical Comparison (25% Traffic, 5min)
[14:37:20] Canary Phase 2: Statistical Comparison
Traffic: 25% to canary
Duration: 5 minutes
Running statistical tests...
RESULTS:
✓ Response Latency (Mann-Whitney U):
Canary p50: 41ms (vs baseline 42ms)
Canary p95: 183ms (vs baseline 185ms)
p-value: 0.087 (PASS - p > 0.05, no significant difference)
✓ Error Rate (Chi-Square):
Canary: 0.11% (vs baseline 0.12%)
p-value: 0.412 (PASS - no significant difference)
✓ Throughput (Mann-Whitney U):
Canary: 8,450 req/s (vs baseline 8,420 req/s)
p-value: 0.654 (PASS - no significant difference)
⚠ CPU Usage (Mann-Whitney U):
Canary: 62% (vs baseline 58%)
p-value: 0.032 (WARNING - slight increase, p < 0.05)
Impact: Non-critical, within acceptable thresholds
VERDICT: PASS - All critical metrics passing
Advancing to Phase 3 (100% traffic)...
Phase 3: Full Validation (100% Traffic, 5min) - PASS
[14:42:25] Canary Phase 3: Full Validation
Traffic: 100% shifted to canary
Duration: 5 minutes
Final validation...
RESULTS:
✓ Response Latency (Mann-Whitney U): p-value 0.091 (PASS)
✓ Error Rate (Chi-Square): p-value 0.387 (PASS)
✓ Throughput (Mann-Whitney U): p-value 0.621 (PASS)
✓ CPU Usage (Mann-Whitney U): p-value 0.048 (PASS)
FINAL VERDICT: PASS
Canary deployment successful!
Summary:
- No regressions detected
- All metrics within acceptable ranges
- Deployment stable for 15+ minutes
- Monitoring continues...
Phase 2 or 3: FAIL - Regression Detected
✗ REGRESSION DETECTED: Triggering rollback
REGRESSION ANALYSIS:
1. Response Latency (Mann-Whitney U):
Canary p95: 387ms (vs baseline 185ms) - 109% increase!
p-value: 0.00003 (FAIL - p < 0.001, highly significant)
Impact: CRITICAL - Users experiencing slow responses
2. Error Rate (Chi-Square):
Canary: 3.2% (vs baseline 0.12%) - 2600% increase!
p-value: 0.00000001 (FAIL - p < 0.001, highly significant)
Common errors: 500 Internal Server Error (85%), 502 Bad Gateway (15%)
Impact: CRITICAL - Service degradation
Statistical Evidence:
- Effect size: 2.8 (VERY LARGE - outside acceptable range)
- Confidence: 99.999% that regression is real
- Samples: 2,845 canary requests vs 2,812 baseline
Automatic Rollback Initiated:
Current: django:v1.22.0-context-api (canary)
Rollback to: django:v1.21.8-stable (baseline)
Status: Rolling back 5/5 pods...
Timeline:
14:37:20 - Phase 2 started (25% traffic)
14:37:45 - Error rate spike detected
14:37:50 - Regression confirmed by statistical tests
14:38:00 - Rollback started
14:38:45 - Rollback complete
Notification sent to @on-call team
Incident ticket created: INC-2026-0543
Recommended: Review error logs, check database migration compatibility
Failure Handling
| Scenario | Action | Rollback |
|---|---|---|
| Regression detected (p < 0.01) | Auto-rollback (ops confirm) | Yes |
| Metrics collection timeout | Wait for next sample window | Conditional |
| Statistical test error | Log error, mark as warning | No |
| Traffic shifting fails | Alert ops, pause canary | No |
| Baseline collection incomplete | Extend Phase 1, delay Phase 2 | No |
| Multiple regressions | Immediate rollback | Yes |
Error Recovery:
# Check canary metrics in detail
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# Query: rate(http_request_duration_seconds_bucket[5m])
# Manually advance canary traffic (if safe)
kubectl patch service django -p '{"spec":{"selector":{"version":"canary"}}}'
# Manual rollback
kubectl rollout undo deployment/django -n prod
# Re-run canary after fix
kubectl set image deployment/django django=django:v1.22.0-context-api-fixed -n prod
# Canary monitor will restart automatically
Related Hooks
| Hook | Timing | Relationship | Purpose |
|---|---|---|---|
post-deploy-smoke-test.md | Post-deploy (before canary) | Upstream | Validates basic health before canary monitoring |
pre-deploy-release-gate.md | Pre-deploy | Upstream | Quality gate; canary validates gate was correct |
post-deploy-metric-dashboard.md | Post-deploy (continuous) | Parallel | Displays canary metrics in real-time |
post-deploy-incident-detection.md | Post-deploy (continuous) | Parallel | Detects anomalies beyond statistical analysis |
Principles
- Statistical Rigor: Uses well-established hypothesis tests (Mann-Whitney U, KS, Chi-Square)
- Effect Size Matters: Not just p-values; requires meaningful difference (effect size > 0.2)
- Graduated Risk: Progressive traffic increases (5% → 25% → 100%) reduce blast radius
- Auto-Recovery: Automatic rollback on critical regressions respects human oversight
- Transparent Evidence: All statistical results logged; full analysis available
- Phase Flexibility: Configurable phase durations and traffic percentages per deployment
- Skill-Driven: Statistical models and metric thresholds managed by canary-analysis-patterns skill
Related Documentation:
- ADR-183 - Governance hook architecture
- ADR-060 - MoE verification layer
- skills/canary-analysis-patterns/SKILL.md - Statistical analysis
- Mann-Whitney U Test: https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
- Kolmogorov-Smirnov Test: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test