Skip to main content

Post-Deploy Canary Monitor

Purpose

  1. Continuously monitor canary deployment metrics against baseline (previous production version)
  2. Run statistical tests (Mann-Whitney U, Kolmogorov-Smirnov) to detect performance/error regressions
  3. Auto-advance traffic to canary if metrics pass statistical significance threshold (p > 0.05)
  4. Trigger automatic rollback if critical metrics degrade significantly (p < 0.01)
  5. Generate detailed canary analysis report with statistical evidence and recommendations

Trigger

PropertyValue
Eventpost-deploy-canary, continuous during soak period
BlockingNo (can trigger rollback via side effect)
Timeout1800s (30 minutes)
Failure ModeIf regression detected: Auto-rollback (requires ops confirmation)
Canary Soak PeriodConfigurable (default: 15 minutes @ 5% traffic, 10 minutes @ 25%, then full)

Behavior

When Triggered

The hook executes continuously during the canary soak period. It:

  • Collects Metrics: Samples from canary and baseline simultaneously

    • Response latency (p50, p95, p99 percentiles)
    • Error rate (4xx, 5xx by endpoint)
    • Throughput (requests/sec)
    • Resource utilization (CPU, memory, disk I/O)
  • Statistical Analysis: Applies hypothesis tests

    • Mann-Whitney U Test: Compares latency distributions (non-parametric)
    • Kolmogorov-Smirnov Test: Detects distribution shape changes
    • Chi-Square Test: Analyzes error rate differences
    • Threshold: p-value > 0.05 for "no significant difference"
  • Traffic Shifting Decisions:

    • Phase 1 (5% traffic, 5min): Collect baseline sample
    • Phase 2 (25% traffic, 5min): Run statistical tests
    • Phase 3 (100% traffic if passing, 5min): Final validation or rollback

Configuration

Create .coditect/config/canary-monitor-hook.json:

{
"enabled": true,
"timeout_seconds": 1800,
"canary_phases": [
{
"phase": 1,
"traffic_percentage": 5,
"duration_seconds": 300,
"min_samples": 100,
"purpose": "baseline-collection"
},
{
"phase": 2,
"traffic_percentage": 25,
"duration_seconds": 300,
"min_samples": 500,
"purpose": "statistical-comparison"
},
{
"phase": 3,
"traffic_percentage": 100,
"duration_seconds": 300,
"min_samples": 1000,
"purpose": "full-validation"
}
],
"metrics": [
{
"name": "response-latency",
"type": "histogram",
"percentiles": [50, 95, 99],
"threshold_percent_increase": 10,
"statistical_test": "mann-whitney-u",
"critical": true
},
{
"name": "error-rate",
"type": "counter",
"threshold_percent_increase": 5,
"statistical_test": "chi-square",
"critical": true
},
{
"name": "throughput",
"type": "gauge",
"threshold_percent_decrease": 10,
"statistical_test": "mann-whitney-u",
"critical": false
},
{
"name": "cpu-usage",
"type": "gauge",
"threshold_percent_increase": 15,
"statistical_test": "mann-whitney-u",
"critical": false
}
],
"statistical_thresholds": {
"pass_p_value": 0.05,
"warning_p_value": 0.01,
"rollback_p_value": 0.001,
"min_effect_size": 0.2
},
"traffic_advancing": {
"auto_advance_on_pass": true,
"hold_before_full_traffic": 120
},
"auto_rollback": {
"enabled": true,
"requires_ops_confirmation": true,
"rollback_timeout_seconds": 60
},
"notifications": {
"on_phase_complete": ["slack-deployments"],
"on_regression_detected": ["slack-ops", "pagerduty"],
"on_rollback": ["slack-ops", "email-team-lead"]
}
}

Integration

The hook integrates with:

  • Skill: canary-analysis-patterns - Statistical test logic and metric collection
  • Metrics Store: Queries Prometheus/DataDog for metric samples
  • Traffic Control: Updates ingress/load balancer rules for traffic shifting
  • Rollback System: Triggers rollback if regressions detected
  • Monitoring Dashboard: Feeds canary analysis results to dashboard
  • Notifications: Alerts on-call team for critical decisions

Output

Phase 1: Baseline Collection (5% Traffic, 5min)

[14:32:15] Canary Phase 1: Baseline Collection
Traffic: 5% to canary
Duration: 5 minutes
Collecting metrics from baseline...

Samples collected: 487
- Response latency p50: 42ms
- Response latency p95: 185ms
- Error rate: 0.12%

Baseline ready for comparison. Phase 2 starting...

Phase 2: Statistical Comparison (25% Traffic, 5min)

[14:37:20] Canary Phase 2: Statistical Comparison
Traffic: 25% to canary
Duration: 5 minutes
Running statistical tests...

RESULTS:
✓ Response Latency (Mann-Whitney U):
Canary p50: 41ms (vs baseline 42ms)
Canary p95: 183ms (vs baseline 185ms)
p-value: 0.087 (PASS - p > 0.05, no significant difference)

✓ Error Rate (Chi-Square):
Canary: 0.11% (vs baseline 0.12%)
p-value: 0.412 (PASS - no significant difference)

✓ Throughput (Mann-Whitney U):
Canary: 8,450 req/s (vs baseline 8,420 req/s)
p-value: 0.654 (PASS - no significant difference)

⚠ CPU Usage (Mann-Whitney U):
Canary: 62% (vs baseline 58%)
p-value: 0.032 (WARNING - slight increase, p < 0.05)
Impact: Non-critical, within acceptable thresholds

VERDICT: PASS - All critical metrics passing
Advancing to Phase 3 (100% traffic)...

Phase 3: Full Validation (100% Traffic, 5min) - PASS

[14:42:25] Canary Phase 3: Full Validation
Traffic: 100% shifted to canary
Duration: 5 minutes
Final validation...

RESULTS:
✓ Response Latency (Mann-Whitney U): p-value 0.091 (PASS)
✓ Error Rate (Chi-Square): p-value 0.387 (PASS)
✓ Throughput (Mann-Whitney U): p-value 0.621 (PASS)
✓ CPU Usage (Mann-Whitney U): p-value 0.048 (PASS)

FINAL VERDICT: PASS
Canary deployment successful!

Summary:
- No regressions detected
- All metrics within acceptable ranges
- Deployment stable for 15+ minutes
- Monitoring continues...

Phase 2 or 3: FAIL - Regression Detected

✗ REGRESSION DETECTED: Triggering rollback

REGRESSION ANALYSIS:
1. Response Latency (Mann-Whitney U):
Canary p95: 387ms (vs baseline 185ms) - 109% increase!
p-value: 0.00003 (FAIL - p < 0.001, highly significant)
Impact: CRITICAL - Users experiencing slow responses

2. Error Rate (Chi-Square):
Canary: 3.2% (vs baseline 0.12%) - 2600% increase!
p-value: 0.00000001 (FAIL - p < 0.001, highly significant)
Common errors: 500 Internal Server Error (85%), 502 Bad Gateway (15%)
Impact: CRITICAL - Service degradation

Statistical Evidence:
- Effect size: 2.8 (VERY LARGE - outside acceptable range)
- Confidence: 99.999% that regression is real
- Samples: 2,845 canary requests vs 2,812 baseline

Automatic Rollback Initiated:
Current: django:v1.22.0-context-api (canary)
Rollback to: django:v1.21.8-stable (baseline)
Status: Rolling back 5/5 pods...

Timeline:
14:37:20 - Phase 2 started (25% traffic)
14:37:45 - Error rate spike detected
14:37:50 - Regression confirmed by statistical tests
14:38:00 - Rollback started
14:38:45 - Rollback complete

Notification sent to @on-call team
Incident ticket created: INC-2026-0543
Recommended: Review error logs, check database migration compatibility

Failure Handling

ScenarioActionRollback
Regression detected (p < 0.01)Auto-rollback (ops confirm)Yes
Metrics collection timeoutWait for next sample windowConditional
Statistical test errorLog error, mark as warningNo
Traffic shifting failsAlert ops, pause canaryNo
Baseline collection incompleteExtend Phase 1, delay Phase 2No
Multiple regressionsImmediate rollbackYes

Error Recovery:

# Check canary metrics in detail
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# Query: rate(http_request_duration_seconds_bucket[5m])

# Manually advance canary traffic (if safe)
kubectl patch service django -p '{"spec":{"selector":{"version":"canary"}}}'

# Manual rollback
kubectl rollout undo deployment/django -n prod

# Re-run canary after fix
kubectl set image deployment/django django=django:v1.22.0-context-api-fixed -n prod
# Canary monitor will restart automatically
HookTimingRelationshipPurpose
post-deploy-smoke-test.mdPost-deploy (before canary)UpstreamValidates basic health before canary monitoring
pre-deploy-release-gate.mdPre-deployUpstreamQuality gate; canary validates gate was correct
post-deploy-metric-dashboard.mdPost-deploy (continuous)ParallelDisplays canary metrics in real-time
post-deploy-incident-detection.mdPost-deploy (continuous)ParallelDetects anomalies beyond statistical analysis

Principles

  1. Statistical Rigor: Uses well-established hypothesis tests (Mann-Whitney U, KS, Chi-Square)
  2. Effect Size Matters: Not just p-values; requires meaningful difference (effect size > 0.2)
  3. Graduated Risk: Progressive traffic increases (5% → 25% → 100%) reduce blast radius
  4. Auto-Recovery: Automatic rollback on critical regressions respects human oversight
  5. Transparent Evidence: All statistical results logged; full analysis available
  6. Phase Flexibility: Configurable phase durations and traffic percentages per deployment
  7. Skill-Driven: Statistical models and metric thresholds managed by canary-analysis-patterns skill

Related Documentation: