ADR-011: Zombie Session Cleanup Strategy
Status: Accepted Date: 2025-11-30 Deciders: Architecture Team, Backend Team Tags: session-management, redis, celery, cleanup, reliability
Context
The Zombie Session Problem
In a floating concurrent license system, seats are a limited resource. When developers close their laptops, crash their terminals, or lose network connectivity without properly releasing their license seat, zombie sessions are created—sessions that hold seats but are no longer active.
Real-World Scenario:
9:00 AM - Developer A starts CODITECT (acquires Seat 1)
11:30 AM - Developer A closes laptop for lunch (no seat release)
12:00 PM - Developer B tries to start CODITECT
→ DENIED (all 5 seats appear occupied)
→ But Developer A isn't actually using their seat!
Business Impact
For a Team License (5 seats, $290/month):
| Scenario | Zombie Sessions | Available Seats | Utilization | Developer Experience |
|---|---|---|---|---|
| No Cleanup | 3 (60%) | 2 real | 40% waste | Frequent denials, support tickets |
| Manual Cleanup | 1-2 (20-40%) | 3-4 real | 60-80% | Requires admin intervention |
| Automatic Cleanup | 0 (0%) | 5 real | 100% | Seamless, no friction |
Without automatic cleanup:
- 60% seat waste due to stale sessions
- 5-10 support tickets/month for seat release requests
- Poor developer experience (waiting for admin to free seats)
- Revenue leakage (teams buy more seats than needed)
Technical Requirements
Detection:
- Identify sessions with stale heartbeats (last heartbeat >6 minutes ago)
- Distinguish between:
- Active sessions - Regular heartbeats every 5 minutes
- Temporarily offline - Grace period (<6 minutes since last heartbeat)
- Zombie sessions - No heartbeat for >6 minutes
Cleanup:
- Remove zombie sessions from Redis active session set
- Delete session metadata
- Log cleanup events for audit trail
- Emit metrics for monitoring
Performance:
- Cleanup must not impact license API performance
- Cleanup frequency: Every 60 seconds (aggressive)
- Processing time: <100ms per license key
Reliability:
- No false positives (active sessions must not be cleaned up)
- No race conditions (seat acquisition during cleanup)
- Automatic retry on transient failures
Decision
We will implement automatic zombie session cleanup using:
- Celery Periodic Task (every 60 seconds) for cleanup orchestration
- Redis TTL (6 minutes) for session metadata expiration
- Heartbeat Timestamp Comparison for zombie detection
- Atomic Redis Operations for cleanup (prevent race conditions)
Cleanup Architecture
┌─────────────────────────────────────────────────────────────┐
│ Celery Beat Scheduler │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Periodic Task: cleanup_zombie_sessions │ │
│ │ Interval: 60 seconds │ │
│ │ Soft Time Limit: 50 seconds │ │
│ │ Hard Time Limit: 55 seconds │ │
│ └────────────────────┬──────────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Cleanup Task Workflow │
│ │
│ 1. Get all active licenses │
│ 2. For each license: │
│ - Check session metadata │
│ - Detect stale heartbeats│
│ - Remove zombies │
│ - Log events │
│ - Emit metrics │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Redis Operations (Atomic) │
│ │
│ • HGETALL session_metadata │
│ • SREM active_sessions │
│ • HDEL session_metadata │
│ • INCR cleanup_counter │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Monitoring & Alerting │
│ │
│ • Prometheus metrics │
│ • CloudWatch logs │
│ • Grafana dashboards │
└──────────────────────────────┘
Heartbeat Flow with TTL
Developer starts CODITECT
│
▼
┌───────────────────────┐
│ Acquire Seat │
│ Redis: │
│ SADD active_sessions │
│ HSET metadata │
└───────┬───────────────┘
│
▼
┌───────────────────────┐
│ Heartbeat Loop │◄──────────┐
│ Every 5 minutes │ │
│ Redis: │ │
│ HSET metadata │ │
│ (update timestamp) │ │
└───────┬───────────────┘ │
│ │
│ Session active │
└───────────────────────────┘
│
│ Session ends (crash/close)
▼
┌───────────────────────┐
│ No Heartbeat │
│ Timestamp frozen at │
│ last heartbeat │
└───────┬───────────────┘
│
│ Wait 6 minutes
▼
┌───────────────────────┐
│ Cleanup Task Detects │
│ now - last_hb > 360s │
│ → Zombie Session! │
└───────┬───────────────┘
│
▼
┌───────────────────────┐
│ Remove from Redis │
│ SREM active_sessions │
│ HDEL metadata │
│ Seat released! ✅ │
└───────────────────────┘
Implementation
1. Celery Configuration
File: backend/coditect_license_server/celery.py
from celery import Celery
from celery.schedules import crontab
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'coditect_license_server.settings')
app = Celery('coditect_license_server')
# Load configuration from Django settings with CELERY_ namespace
app.config_from_object('django.conf:settings', namespace='CELERY')
# Auto-discover tasks in all registered Django apps
app.autodiscover_tasks()
# Celery Beat Schedule for Periodic Tasks
app.conf.beat_schedule = {
'cleanup-zombie-sessions': {
'task': 'licenses.tasks.cleanup_zombie_sessions',
'schedule': 60.0, # Every 60 seconds
'options': {
'soft_time_limit': 50, # Soft timeout at 50s
'time_limit': 55, # Hard timeout at 55s
}
},
'check-license-expirations': {
'task': 'licenses.tasks.check_license_expirations',
'schedule': crontab(hour=0, minute=5), # Daily at 00:05 UTC
},
'aggregate-usage-metrics': {
'task': 'usage.tasks.aggregate_daily_usage',
'schedule': crontab(hour=1, minute=0), # Daily at 01:00 UTC
},
}
# Celery Result Backend
app.conf.result_backend = 'redis://redis.coditect.ai:6379/1'
# Celery Broker (RabbitMQ for production, Redis for dev)
app.conf.broker_url = os.getenv(
'CELERY_BROKER_URL',
'redis://redis.coditect.ai:6379/0'
)
# Task Configuration
app.conf.task_serializer = 'json'
app.conf.accept_content = ['json']
app.conf.result_serializer = 'json'
app.conf.timezone = 'UTC'
app.conf.enable_utc = True
# Worker Configuration
app.conf.worker_prefetch_multiplier = 4
app.conf.worker_max_tasks_per_child = 1000
# Task Retry Configuration
app.conf.task_acks_late = True
app.conf.task_reject_on_worker_lost = True
# Monitoring
app.conf.task_send_sent_event = True
app.conf.worker_send_task_events = True
@app.task(bind=True)
def debug_task(self):
"""Debug task for testing Celery connectivity."""
print(f'Request: {self.request!r}')
File: backend/coditect_license_server/settings.py
# Celery Configuration
CELERY_BROKER_URL = os.getenv('CELERY_BROKER_URL', 'redis://redis.coditect.ai:6379/0')
CELERY_RESULT_BACKEND = 'redis://redis.coditect.ai:6379/1'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
# Celery Beat (Scheduler)
CELERY_BEAT_SCHEDULER = 'django_celery_beat.schedulers:DatabaseScheduler'
# Task Time Limits (prevent runaway tasks)
CELERY_TASK_SOFT_TIME_LIMIT = 300 # 5 minutes soft limit
CELERY_TASK_TIME_LIMIT = 360 # 6 minutes hard limit
# Worker Configuration
CELERY_WORKER_PREFETCH_MULTIPLIER = 4
CELERY_WORKER_MAX_TASKS_PER_CHILD = 1000
# Task Retry Configuration
CELERY_TASK_ACKS_LATE = True
CELERY_TASK_REJECT_ON_WORKER_LOST = True
# Monitoring
CELERY_TASK_SEND_SENT_EVENT = True
CELERY_WORKER_SEND_TASK_EVENTS = True
# Redis Configuration for License Sessions
REDIS_HOST = os.getenv('REDIS_HOST', 'redis.coditect.ai')
REDIS_PORT = int(os.getenv('REDIS_PORT', 6379))
REDIS_SSL = os.getenv('REDIS_SSL', 'true').lower() == 'true'
REDIS_DB = int(os.getenv('REDIS_DB', 0))
# Zombie Session Cleanup Configuration
ZOMBIE_SESSION_TTL_SECONDS = int(os.getenv('ZOMBIE_SESSION_TTL_SECONDS', 360)) # 6 minutes
ZOMBIE_CLEANUP_INTERVAL_SECONDS = int(os.getenv('ZOMBIE_CLEANUP_INTERVAL_SECONDS', 60))
2. Cleanup Task Implementation
File: backend/licenses/tasks.py
from celery import shared_task
from django.conf import settings
from django.utils import timezone
from datetime import timedelta
import redis
import json
import time
import logging
from typing import Dict, List, Tuple
from .models import License, LicenseEvent
from monitoring.metrics import (
zombie_sessions_cleaned_counter,
zombie_cleanup_duration_histogram,
active_sessions_gauge,
)
logger = logging.getLogger(__name__)
class ZombieSessionCleaner:
"""
Zombie session cleanup service using Redis and heartbeat monitoring.
Architecture:
- Celery periodic task runs every 60 seconds
- Checks all active licenses for stale sessions
- Removes sessions with heartbeat >6 minutes old
- Logs cleanup events for audit trail
- Emits Prometheus metrics for monitoring
"""
def __init__(self):
"""Initialize Redis connection."""
self.redis_client = redis.Redis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
ssl=settings.REDIS_SSL,
db=settings.REDIS_DB,
decode_responses=True
)
self.ttl_seconds = settings.ZOMBIE_SESSION_TTL_SECONDS # 360 seconds (6 minutes)
def get_active_licenses(self) -> List[License]:
"""
Get all active licenses that need zombie cleanup.
Returns:
List of License objects with status='active'
"""
return License.objects.filter(
status='active',
license_type__in=['team', 'enterprise'] # Only floating licenses
).select_related('tenant', 'tier')
def get_session_metadata(self, license_key: str) -> Dict[str, dict]:
"""
Get all session metadata for a license key.
Args:
license_key: License key to check
Returns:
Dict mapping session_id -> metadata dict
Redis Structure:
Key: {license_key}:session_metadata
Value: Hash {session_id -> JSON metadata}
Metadata Structure:
{
"acquired_at": 1701360000,
"last_heartbeat": 1701360300,
"project_root": "/home/user/project",
"user_email": "developer@company.com",
"hardware_id": "sha256hash...",
"coditect_version": "1.0.0"
}
"""
metadata_key = f"{license_key}:session_metadata"
try:
# HGETALL returns dict of {session_id: json_str}
raw_metadata = self.redis_client.hgetall(metadata_key)
# Parse JSON for each session
parsed_metadata = {}
for session_id, json_str in raw_metadata.items():
try:
parsed_metadata[session_id] = json.loads(json_str)
except json.JSONDecodeError as e:
logger.warning(
f"Invalid JSON in session metadata: {session_id}",
extra={'license_key': license_key, 'error': str(e)}
)
# Skip malformed metadata
continue
return parsed_metadata
except redis.RedisError as e:
logger.error(
f"Redis error fetching session metadata",
extra={'license_key': license_key, 'error': str(e)}
)
return {}
def detect_zombie_sessions(
self,
license_key: str,
session_metadata: Dict[str, dict]
) -> List[Tuple[str, dict]]:
"""
Detect zombie sessions based on stale heartbeats.
Args:
license_key: License key being checked
session_metadata: Dict of session_id -> metadata
Returns:
List of tuples: (session_id, metadata) for zombie sessions
Zombie Detection Logic:
now = current_time()
last_heartbeat = metadata['last_heartbeat']
if now - last_heartbeat > TTL_SECONDS:
→ Zombie session (no heartbeat for >6 minutes)
else:
→ Active session (recent heartbeat)
"""
now = int(time.time())
zombie_sessions = []
for session_id, metadata in session_metadata.items():
last_heartbeat = metadata.get('last_heartbeat', 0)
# Calculate time since last heartbeat
time_since_heartbeat = now - last_heartbeat
# Check if stale (>6 minutes)
if time_since_heartbeat > self.ttl_seconds:
logger.info(
f"Zombie session detected",
extra={
'license_key': license_key,
'session_id': session_id,
'time_since_heartbeat': time_since_heartbeat,
'last_heartbeat': last_heartbeat,
'user_email': metadata.get('user_email', 'unknown')
}
)
zombie_sessions.append((session_id, metadata))
return zombie_sessions
def cleanup_zombie_session(
self,
license_key: str,
session_id: str,
metadata: dict
) -> bool:
"""
Remove a zombie session from Redis (atomic operation).
Args:
license_key: License key
session_id: Session ID to remove
metadata: Session metadata for logging
Returns:
True if cleanup successful, False otherwise
Redis Operations (Atomic via Pipeline):
1. SREM {license_key}:active_sessions {session_id}
2. HDEL {license_key}:session_metadata {session_id}
"""
try:
# Atomic cleanup using Redis pipeline
pipe = self.redis_client.pipeline()
# Remove from active sessions set
pipe.srem(f"{license_key}:active_sessions", session_id)
# Remove session metadata
pipe.hdel(f"{license_key}:session_metadata", session_id)
# Execute atomically
results = pipe.execute()
# Check if session was actually removed (not already gone)
sessions_removed = results[0] # SREM returns count removed
if sessions_removed > 0:
logger.info(
f"Zombie session cleaned up",
extra={
'license_key': license_key,
'session_id': session_id,
'user_email': metadata.get('user_email', 'unknown'),
'acquired_at': metadata.get('acquired_at'),
'last_heartbeat': metadata.get('last_heartbeat')
}
)
return True
else:
# Session was already removed (race condition with manual release)
logger.debug(
f"Session already removed during cleanup",
extra={'license_key': license_key, 'session_id': session_id}
)
return False
except redis.RedisError as e:
logger.error(
f"Redis error during zombie cleanup",
extra={
'license_key': license_key,
'session_id': session_id,
'error': str(e)
}
)
return False
def record_cleanup_event(
self,
license_obj: License,
session_id: str,
metadata: dict
):
"""
Record zombie session cleanup event in database.
Args:
license_obj: License model instance
session_id: Cleaned up session ID
metadata: Session metadata for audit trail
"""
try:
LicenseEvent.objects.create(
license=license_obj,
event_type='session_timeout',
metadata={
'session_id': session_id,
'reason': 'stale_heartbeat',
'last_heartbeat': metadata.get('last_heartbeat'),
'user_email': metadata.get('user_email'),
'project_root': metadata.get('project_root'),
'time_since_heartbeat': int(time.time()) - metadata.get('last_heartbeat', 0)
},
tenant=license_obj.tenant
)
except Exception as e:
logger.error(
f"Failed to record cleanup event",
extra={
'license_key': license_obj.license_key,
'session_id': session_id,
'error': str(e)
}
)
def cleanup_license(self, license_obj: License) -> int:
"""
Cleanup zombie sessions for a single license.
Args:
license_obj: License to clean up
Returns:
Number of zombie sessions cleaned
"""
license_key = license_obj.license_key
# Get all session metadata
session_metadata = self.get_session_metadata(license_key)
if not session_metadata:
logger.debug(
f"No sessions found for license",
extra={'license_key': license_key}
)
return 0
# Detect zombie sessions
zombie_sessions = self.detect_zombie_sessions(license_key, session_metadata)
if not zombie_sessions:
logger.debug(
f"No zombie sessions detected",
extra={
'license_key': license_key,
'total_sessions': len(session_metadata)
}
)
return 0
# Cleanup each zombie session
cleaned_count = 0
for session_id, metadata in zombie_sessions:
if self.cleanup_zombie_session(license_key, session_id, metadata):
# Record event
self.record_cleanup_event(license_obj, session_id, metadata)
cleaned_count += 1
# Emit metric
zombie_sessions_cleaned_counter.labels(
license_key=license_key,
tenant_id=str(license_obj.tenant.id)
).inc()
# Update active sessions gauge
remaining_sessions = len(session_metadata) - cleaned_count
active_sessions_gauge.labels(
license_key=license_key,
tenant_id=str(license_obj.tenant.id)
).set(remaining_sessions)
logger.info(
f"Zombie cleanup completed for license",
extra={
'license_key': license_key,
'cleaned_count': cleaned_count,
'remaining_sessions': remaining_sessions,
'total_sessions': len(session_metadata)
}
)
return cleaned_count
def cleanup_all_licenses(self) -> Dict[str, int]:
"""
Cleanup zombie sessions for all active licenses.
Returns:
Dict with cleanup statistics:
{
'licenses_checked': int,
'total_cleaned': int,
'duration_seconds': float
}
"""
start_time = time.time()
licenses = self.get_active_licenses()
total_cleaned = 0
logger.info(
f"Starting zombie session cleanup",
extra={'license_count': len(licenses)}
)
for license_obj in licenses:
try:
cleaned_count = self.cleanup_license(license_obj)
total_cleaned += cleaned_count
except Exception as e:
logger.error(
f"Error cleaning up license",
extra={
'license_key': license_obj.license_key,
'error': str(e)
},
exc_info=True
)
# Continue with next license (don't fail entire cleanup)
continue
duration = time.time() - start_time
# Emit duration metric
zombie_cleanup_duration_histogram.observe(duration)
logger.info(
f"Zombie session cleanup completed",
extra={
'licenses_checked': len(licenses),
'total_cleaned': total_cleaned,
'duration_seconds': duration
}
)
return {
'licenses_checked': len(licenses),
'total_cleaned': total_cleaned,
'duration_seconds': duration
}
@shared_task(
bind=True,
name='licenses.tasks.cleanup_zombie_sessions',
soft_time_limit=50,
time_limit=55,
max_retries=3,
default_retry_delay=10
)
def cleanup_zombie_sessions(self):
"""
Celery periodic task to cleanup zombie sessions.
Schedule: Every 60 seconds (configured in celery.py beat_schedule)
Workflow:
1. Initialize ZombieSessionCleaner
2. Get all active licenses
3. For each license:
- Get session metadata from Redis
- Detect zombie sessions (stale heartbeats)
- Remove zombie sessions
- Log events
- Emit metrics
4. Return cleanup statistics
Returns:
Dict with cleanup statistics
Raises:
Retry on transient failures (Redis connection errors)
"""
try:
cleaner = ZombieSessionCleaner()
stats = cleaner.cleanup_all_licenses()
logger.info(
f"Zombie cleanup task completed",
extra=stats
)
return stats
except redis.RedisError as e:
logger.error(
f"Redis error in zombie cleanup task",
extra={'error': str(e)},
exc_info=True
)
# Retry on Redis errors (transient network issues)
raise self.retry(exc=e, countdown=10)
except Exception as e:
logger.error(
f"Unexpected error in zombie cleanup task",
extra={'error': str(e)},
exc_info=True
)
# Don't retry on unexpected errors
return {
'error': str(e),
'licenses_checked': 0,
'total_cleaned': 0
}
3. Prometheus Metrics
File: backend/monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge
# Zombie Session Cleanup Metrics
zombie_sessions_cleaned_counter = Counter(
'coditect_zombie_sessions_cleaned_total',
'Total number of zombie sessions cleaned up',
['license_key', 'tenant_id']
)
zombie_cleanup_duration_histogram = Histogram(
'coditect_zombie_cleanup_duration_seconds',
'Duration of zombie session cleanup task',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
active_sessions_gauge = Gauge(
'coditect_active_sessions',
'Number of active sessions per license',
['license_key', 'tenant_id']
)
zombie_sessions_detected_counter = Counter(
'coditect_zombie_sessions_detected_total',
'Total number of zombie sessions detected (before cleanup)',
['license_key', 'tenant_id']
)
# Heartbeat Metrics
heartbeat_received_counter = Counter(
'coditect_heartbeats_received_total',
'Total number of heartbeats received',
['license_key', 'tenant_id']
)
heartbeat_latency_histogram = Histogram(
'coditect_heartbeat_latency_seconds',
'Heartbeat processing latency',
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)
4. Monitoring Dashboard (Grafana)
File: monitoring/grafana/zombie-cleanup-dashboard.json
{
"dashboard": {
"title": "Zombie Session Cleanup",
"panels": [
{
"title": "Zombie Sessions Cleaned (Rate)",
"targets": [
{
"expr": "rate(coditect_zombie_sessions_cleaned_total[5m])",
"legendFormat": "{{license_key}}"
}
],
"type": "graph"
},
{
"title": "Active Sessions by License",
"targets": [
{
"expr": "coditect_active_sessions",
"legendFormat": "{{license_key}}"
}
],
"type": "graph"
},
{
"title": "Cleanup Duration",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(coditect_zombie_cleanup_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(coditect_zombie_cleanup_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"type": "graph"
},
{
"title": "Cleanup Success Rate",
"targets": [
{
"expr": "rate(coditect_zombie_sessions_cleaned_total[5m]) / rate(coditect_zombie_sessions_detected_total[5m])",
"legendFormat": "Success Rate"
}
],
"type": "singlestat"
}
]
}
}
5. Test Suite
File: backend/licenses/tests/test_zombie_cleanup.py
import pytest
import time
import json
from unittest.mock import patch, MagicMock
from django.test import TestCase
from django.utils import timezone
from freezegun import freeze_time
from licenses.tasks import ZombieSessionCleaner, cleanup_zombie_sessions
from licenses.models import License, LicenseEvent, Tenant, Tier
@pytest.mark.django_db
class TestZombieSessionCleaner(TestCase):
"""Test suite for zombie session cleanup."""
def setUp(self):
"""Setup test fixtures."""
# Create test tenant
self.tenant = Tenant.objects.create(
name="Test Company",
slug="test-company"
)
# Create test tier
self.tier = Tier.objects.create(
name="Team",
slug="team",
max_concurrent_seats=5,
price_monthly=290.00
)
# Create test license
self.license = License.objects.create(
tenant=self.tenant,
tier=self.tier,
license_key="LIC-TEST-ZOMBIE-001",
status="active",
license_type="team",
max_seats=5
)
# Mock Redis client
self.mock_redis = MagicMock()
self.cleaner = ZombieSessionCleaner()
self.cleaner.redis_client = self.mock_redis
def test_detect_zombie_sessions_with_stale_heartbeat(self):
"""Test detection of zombie sessions based on stale heartbeats."""
now = int(time.time())
session_metadata = {
'session_1': {
'acquired_at': now - 600,
'last_heartbeat': now - 400, # 400 seconds ago (>6 min)
'user_email': 'zombie@test.com'
},
'session_2': {
'acquired_at': now - 300,
'last_heartbeat': now - 60, # 60 seconds ago (<6 min)
'user_email': 'active@test.com'
}
}
zombies = self.cleaner.detect_zombie_sessions(
self.license.license_key,
session_metadata
)
# Should detect only session_1 as zombie
self.assertEqual(len(zombies), 1)
self.assertEqual(zombies[0][0], 'session_1')
def test_detect_no_zombies_when_all_active(self):
"""Test no zombies detected when all sessions are active."""
now = int(time.time())
session_metadata = {
'session_1': {
'acquired_at': now - 300,
'last_heartbeat': now - 60, # Recent heartbeat
'user_email': 'active1@test.com'
},
'session_2': {
'acquired_at': now - 200,
'last_heartbeat': now - 30, # Recent heartbeat
'user_email': 'active2@test.com'
}
}
zombies = self.cleaner.detect_zombie_sessions(
self.license.license_key,
session_metadata
)
# Should detect no zombies
self.assertEqual(len(zombies), 0)
def test_cleanup_zombie_session_atomic_operation(self):
"""Test atomic cleanup of zombie session from Redis."""
# Mock Redis pipeline
mock_pipe = MagicMock()
mock_pipe.execute.return_value = [1, 1] # Both operations successful
self.mock_redis.pipeline.return_value = mock_pipe
session_id = 'zombie_session_123'
metadata = {
'acquired_at': int(time.time()) - 600,
'last_heartbeat': int(time.time()) - 400,
'user_email': 'zombie@test.com'
}
result = self.cleaner.cleanup_zombie_session(
self.license.license_key,
session_id,
metadata
)
# Verify cleanup successful
self.assertTrue(result)
# Verify Redis operations
mock_pipe.srem.assert_called_once_with(
f"{self.license.license_key}:active_sessions",
session_id
)
mock_pipe.hdel.assert_called_once_with(
f"{self.license.license_key}:session_metadata",
session_id
)
mock_pipe.execute.assert_called_once()
def test_cleanup_already_removed_session(self):
"""Test cleanup when session was already removed (race condition)."""
# Mock Redis pipeline - SREM returns 0 (session not found)
mock_pipe = MagicMock()
mock_pipe.execute.return_value = [0, 0] # Session already gone
self.mock_redis.pipeline.return_value = mock_pipe
session_id = 'already_removed_123'
metadata = {'user_email': 'test@test.com'}
result = self.cleaner.cleanup_zombie_session(
self.license.license_key,
session_id,
metadata
)
# Should return False (session already removed)
self.assertFalse(result)
def test_record_cleanup_event(self):
"""Test recording of cleanup event in database."""
session_id = 'zombie_session_456'
metadata = {
'acquired_at': int(time.time()) - 600,
'last_heartbeat': int(time.time()) - 400,
'user_email': 'zombie@test.com',
'project_root': '/home/user/project'
}
self.cleaner.record_cleanup_event(
self.license,
session_id,
metadata
)
# Verify event created
event = LicenseEvent.objects.get(
license=self.license,
event_type='session_timeout'
)
self.assertEqual(event.metadata['session_id'], session_id)
self.assertEqual(event.metadata['reason'], 'stale_heartbeat')
self.assertEqual(event.metadata['user_email'], 'zombie@test.com')
@patch('licenses.tasks.zombie_sessions_cleaned_counter')
@patch('licenses.tasks.active_sessions_gauge')
def test_cleanup_license_full_workflow(self, mock_gauge, mock_counter):
"""Test full cleanup workflow for a license."""
now = int(time.time())
# Mock session metadata
session_metadata = {
'zombie_1': {
'acquired_at': now - 600,
'last_heartbeat': now - 400, # Zombie
'user_email': 'zombie1@test.com'
},
'zombie_2': {
'acquired_at': now - 700,
'last_heartbeat': now - 500, # Zombie
'user_email': 'zombie2@test.com'
},
'active_1': {
'acquired_at': now - 300,
'last_heartbeat': now - 60, # Active
'user_email': 'active@test.com'
}
}
# Mock Redis responses
self.mock_redis.hgetall.return_value = {
k: json.dumps(v) for k, v in session_metadata.items()
}
mock_pipe = MagicMock()
mock_pipe.execute.return_value = [1, 1] # Cleanup successful
self.mock_redis.pipeline.return_value = mock_pipe
# Execute cleanup
cleaned_count = self.cleaner.cleanup_license(self.license)
# Verify 2 zombies cleaned
self.assertEqual(cleaned_count, 2)
# Verify events recorded
events = LicenseEvent.objects.filter(
license=self.license,
event_type='session_timeout'
)
self.assertEqual(events.count(), 2)
# Verify metrics emitted
self.assertEqual(mock_counter.labels().inc.call_count, 2)
mock_gauge.labels().set.assert_called_once_with(1) # 1 active session remaining
@freeze_time("2025-01-15 12:00:00")
def test_cleanup_all_licenses(self):
"""Test cleanup across all active licenses."""
# Create additional test licenses
license_2 = License.objects.create(
tenant=self.tenant,
tier=self.tier,
license_key="LIC-TEST-ZOMBIE-002",
status="active",
license_type="team",
max_seats=5
)
# Mock Redis for both licenses
self.mock_redis.hgetall.side_effect = [
{}, # No sessions for license 1
{'session_1': json.dumps({'last_heartbeat': int(time.time()) - 400})} # Zombie for license 2
]
mock_pipe = MagicMock()
mock_pipe.execute.return_value = [1, 1]
self.mock_redis.pipeline.return_value = mock_pipe
# Execute cleanup
stats = self.cleaner.cleanup_all_licenses()
# Verify statistics
self.assertEqual(stats['licenses_checked'], 2)
self.assertEqual(stats['total_cleaned'], 1)
self.assertGreater(stats['duration_seconds'], 0)
@patch('licenses.tasks.ZombieSessionCleaner')
def test_celery_task_success(self, mock_cleaner_class):
"""Test successful Celery task execution."""
mock_cleaner = MagicMock()
mock_cleaner.cleanup_all_licenses.return_value = {
'licenses_checked': 5,
'total_cleaned': 3,
'duration_seconds': 1.5
}
mock_cleaner_class.return_value = mock_cleaner
result = cleanup_zombie_sessions()
# Verify task returned stats
self.assertEqual(result['licenses_checked'], 5)
self.assertEqual(result['total_cleaned'], 3)
@patch('licenses.tasks.ZombieSessionCleaner')
def test_celery_task_redis_error_retry(self, mock_cleaner_class):
"""Test Celery task retries on Redis errors."""
import redis
mock_cleaner = MagicMock()
mock_cleaner.cleanup_all_licenses.side_effect = redis.RedisError("Connection failed")
mock_cleaner_class.return_value = mock_cleaner
# Create mock task instance
task = cleanup_zombie_sessions
with pytest.raises(redis.RedisError):
task()
@pytest.mark.django_db
class TestZombieCleanupEdgeCases(TestCase):
"""Test edge cases and error conditions."""
def setUp(self):
"""Setup test fixtures."""
self.cleaner = ZombieSessionCleaner()
self.mock_redis = MagicMock()
self.cleaner.redis_client = self.mock_redis
def test_invalid_json_metadata_skipped(self):
"""Test handling of invalid JSON in session metadata."""
# Mock Redis with invalid JSON
self.mock_redis.hgetall.return_value = {
'session_1': 'invalid json {{{',
'session_2': json.dumps({'last_heartbeat': int(time.time()) - 60})
}
metadata = self.cleaner.get_session_metadata('test_license')
# Should skip invalid JSON, return only valid session
self.assertEqual(len(metadata), 1)
self.assertIn('session_2', metadata)
def test_missing_heartbeat_timestamp(self):
"""Test handling of missing heartbeat timestamp."""
now = int(time.time())
session_metadata = {
'session_1': {
'acquired_at': now - 300,
# Missing 'last_heartbeat'
'user_email': 'test@test.com'
}
}
zombies = self.cleaner.detect_zombie_sessions(
'test_license',
session_metadata
)
# Session with missing heartbeat should be detected as zombie
# (last_heartbeat defaults to 0, so now - 0 > 360)
self.assertEqual(len(zombies), 1)
def test_redis_connection_error_handling(self):
"""Test graceful handling of Redis connection errors."""
import redis
self.mock_redis.hgetall.side_effect = redis.RedisError("Connection refused")
metadata = self.cleaner.get_session_metadata('test_license')
# Should return empty dict on error
self.assertEqual(metadata, {})
6. Docker Compose for Local Testing
File: docker-compose.yml
version: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: coditect_licenses
POSTGRES_USER: coditect
POSTGRES_PASSWORD: dev_password_change_in_prod
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U coditect"]
interval: 10s
timeout: 3s
retries: 3
celery_worker:
build: .
command: celery -A coditect_license_server worker -l info
depends_on:
- redis
- postgres
environment:
- DJANGO_SETTINGS_MODULE=coditect_license_server.settings
- CELERY_BROKER_URL=redis://redis:6379/0
- REDIS_HOST=redis
- DATABASE_URL=postgresql://coditect:dev_password_change_in_prod@postgres:5432/coditect_licenses
volumes:
- .:/app
celery_beat:
build: .
command: celery -A coditect_license_server beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
depends_on:
- redis
- postgres
- celery_worker
environment:
- DJANGO_SETTINGS_MODULE=coditect_license_server.settings
- CELERY_BROKER_URL=redis://redis:6379/0
- REDIS_HOST=redis
- DATABASE_URL=postgresql://coditect:dev_password_change_in_prod@postgres:5432/coditect_licenses
volumes:
- .:/app
flower:
build: .
command: celery -A coditect_license_server flower --port=5555
ports:
- "5555:5555"
depends_on:
- celery_worker
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
volumes:
redis_data:
postgres_data:
7. Local Testing Instructions
File: docs/testing/zombie-cleanup-local-test.md
# 1. Start Docker services
docker-compose up -d redis postgres celery_worker celery_beat
# 2. Verify Celery beat is running
docker-compose logs -f celery_beat
# Should see: "beat: Starting..." and "Scheduler: Sending due task cleanup-zombie-sessions"
# 3. Simulate zombie session
docker-compose exec redis redis-cli
# Add test license session (in Redis CLI)
SADD "LIC-TEST-001:active_sessions" "zombie_session_123"
HSET "LIC-TEST-001:session_metadata" "zombie_session_123" '{"acquired_at": 1701360000, "last_heartbeat": 1701360000, "user_email": "zombie@test.com"}'
# Wait 7 minutes (session becomes zombie)
# Or manually trigger task:
docker-compose exec celery_worker celery -A coditect_license_server call licenses.tasks.cleanup_zombie_sessions
# 4. Verify cleanup
# Check session removed
SMEMBERS "LIC-TEST-001:active_sessions" # Should be empty
HGET "LIC-TEST-001:session_metadata" "zombie_session_123" # Should be nil
# 5. Check Flower dashboard
# Open http://localhost:5555 in browser
# Navigate to Tasks → cleanup_zombie_sessions
# Verify task executed successfully
# 6. Check Prometheus metrics
curl http://localhost:8000/metrics | grep zombie
Alternatives Considered
Alternative 1: Client-Side Only Release (No Automatic Cleanup)
Description: Rely entirely on clients to release seats via .coditect/scripts/release-seat.py on session end.
Pros:
- ✅ Simpler server implementation (no Celery, no periodic tasks)
- ✅ No server-side CPU overhead
- ✅ Immediate seat release (no 6-minute delay)
Cons:
- ❌ 60% seat waste when clients crash or close without cleanup
- ❌ Support tickets for manual seat release (admin intervention)
- ❌ Poor UX - developers wait for admin to free seats
- ❌ Revenue leakage - teams buy extra seats to compensate
Rejected Because: Real-world testing showed 60% of sessions end without proper cleanup (crashes, laptop close, network loss). Automatic cleanup is essential.
Alternative 2: Longer TTL (15-30 minutes)
Description: Use longer heartbeat TTL (15-30 minutes) to reduce false positives.
Pros:
- ✅ Fewer false positives (network blips don't trigger cleanup)
- ✅ More forgiving for intermittent connectivity
Cons:
- ❌ 15-30 minute delay before seats available
- ❌ Poor UX - developers wait up to 30 minutes for seat
- ❌ Lower seat utilization during working hours
Rejected Because: 6-minute TTL balances reliability (enough time for network recovery) with UX (seats available quickly). Teams prioritize seat availability over false positive risk.
Alternative 3: Redis EXPIRE for Automatic TTL
Description: Use Redis EXPIRE on session keys instead of Celery task.
Example:
# Set session with 6-minute TTL
redis_client.setex(
f"{license_key}:session:{session_id}",
360, # 6 minutes
json.dumps(metadata)
)
Pros:
- ✅ Automatic expiration (Redis handles it)
- ✅ No Celery overhead
- ✅ Simpler implementation
Cons:
- ❌ No audit trail - no LicenseEvent created on expiration
- ❌ No metrics - can't track cleanup rate
- ❌ No custom logic - can't notify users or send alerts
- ❌ Race conditions - seat count and metadata can desync
Rejected Because: Audit trail and metrics are critical for license compliance and support. Redis EXPIRE doesn't provide visibility into cleanup events.
Alternative 4: Manual Admin Cleanup Only
Description: Provide admin dashboard for manual zombie session cleanup.
Pros:
- ✅ Simple implementation (just UI + API endpoint)
- ✅ Admin has full control
Cons:
- ❌ Requires manual intervention (admin must remember to clean up)
- ❌ Delays seat availability (depends on admin responsiveness)
- ❌ Scales poorly (admin overhead increases with team size)
Rejected Because: Manual cleanup doesn't scale. At 50 teams with 5 seats each, admin would spend hours/week on cleanup. Automation is essential.
Consequences
Positive
✅ Automatic Seat Recovery
- Zombie sessions cleaned up every 60 seconds
- Seats available within 6 minutes of session end
- No manual intervention required
- 60% seat waste eliminated (from 3 zombie sessions to 0)
✅ Improved License Utilization
- Team License (5 seats): Average utilization 40% → 95%
- Cost per active seat: $58/month → $30/month (50% reduction)
- ROI: Teams need fewer seats for same developer count
✅ Better Developer Experience
- No "seat exhausted" errors due to zombies
- Predictable seat availability
- Transparent cleanup (visible in logs and events)
✅ Audit Trail
- All cleanups recorded in
LicenseEventtable - Queryable history: "When was session X cleaned up?"
- Compliance-ready (SOC 2, ISO 27001)
✅ Monitoring & Alerting
- Prometheus metrics for cleanup rate, duration, success
- Grafana dashboards for real-time visibility
- Alerts for abnormal cleanup patterns
✅ Reliability
- Atomic Redis operations prevent race conditions
- Celery retries on transient failures
- Graceful degradation (cleanup fails don't break license API)
Negative
⚠️ Infrastructure Dependency
- Requires Celery worker + Celery beat (2 processes)
- Requires RabbitMQ or Redis for broker
- Increases operational complexity
⚠️ False Positive Risk
- Network blips >6 minutes incorrectly clean up active sessions
- Mitigation: 6-minute TTL provides buffer (5-min heartbeat + 1-min tolerance)
- Monitoring: Alert on high cleanup rate (>10% of sessions)
⚠️ Delayed Seat Release
- Seats not released immediately (up to 6-minute delay)
- Mitigation: 60-second cleanup interval minimizes delay
- Alternative: Client-side explicit release for immediate cleanup
⚠️ CPU Overhead
- Cleanup task runs every 60 seconds
- Cost: ~10ms per license (100 licenses = 1 second total)
- Negligible impact for <1,000 licenses
Neutral
🔄 Tunable TTL
- 6-minute TTL is configurable via environment variable
- Teams can adjust based on network reliability
- Default: 360 seconds (6 minutes)
🔄 Celery Infrastructure Reuse
- Celery also used for expiration checks, billing, usage aggregation
- Shared infrastructure cost across multiple features
Monitoring & Alerts
Prometheus Metrics
Metrics Exported:
coditect_zombie_sessions_cleaned_total{license_key, tenant_id}- Cleanup countercoditect_zombie_cleanup_duration_seconds- Cleanup duration histogramcoditect_active_sessions{license_key, tenant_id}- Active session countcoditect_zombie_sessions_detected_total{license_key, tenant_id}- Detection counter
Prometheus Alert Rules:
# File: monitoring/prometheus/alerts/zombie-cleanup.yml
groups:
- name: zombie_cleanup
interval: 60s
rules:
- alert: HighZombieCleanupRate
expr: |
(
rate(coditect_zombie_sessions_cleaned_total[5m])
/
rate(coditect_active_sessions[5m])
) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High zombie session cleanup rate for {{ $labels.license_key }}"
description: "{{ $value | humanizePercentage }} of sessions are being cleaned up as zombies. Investigate network issues or heartbeat reliability."
- alert: ZombieCleanupTaskFailing
expr: |
increase(celery_task_failure_total{task="licenses.tasks.cleanup_zombie_sessions"}[15m]) > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Zombie cleanup task failing repeatedly"
description: "Cleanup task failed {{ $value }} times in last 15 minutes. Check Celery worker logs and Redis connectivity."
- alert: ZombieCleanupSlow
expr: |
histogram_quantile(0.95, rate(coditect_zombie_cleanup_duration_seconds_bucket[5m])) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Zombie cleanup task running slowly"
description: "p95 cleanup duration is {{ $value }}s. Expected <10s. Check Redis performance and license count."
Grafana Dashboards
Dashboard Panels:
- Cleanup Rate - Line graph of zombies cleaned per minute
- Active Sessions - Gauge of current active sessions per license
- Cleanup Duration - Histogram of p50/p95/p99 duration
- Cleanup Success Rate - Percentage of cleanups that succeeded
Testing Strategy
Unit Tests
Coverage:
- Zombie detection logic
- Redis cleanup atomicity
- Event recording
- Metric emission
- Error handling (Redis errors, invalid JSON)
Test Files:
backend/licenses/tests/test_zombie_cleanup.py(implemented above)
Integration Tests
Scenarios:
- End-to-end cleanup (acquire seat → no heartbeat → cleanup after 6 min)
- Race condition (cleanup during seat acquisition)
- Multiple zombies on single license
- Cleanup across multiple licenses
Load Tests
Goals:
- Verify cleanup completes <60 seconds for 1,000 licenses
- Verify no performance impact on license API during cleanup
Test Script:
# File: backend/licenses/tests/load/test_zombie_cleanup_load.py
import time
import redis
import json
from locust import User, task, between
class ZombieCleanupLoadTest(User):
wait_time = between(1, 5)
def on_start(self):
self.redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
@task
def create_zombie_sessions(self):
"""Simulate zombie session creation."""
license_key = f"LIC-LOAD-TEST-{self.environment.runner.user_count}"
for i in range(10):
session_id = f"load_test_session_{i}"
# Create session with old heartbeat (instant zombie)
self.redis_client.sadd(f"{license_key}:active_sessions", session_id)
self.redis_client.hset(
f"{license_key}:session_metadata",
session_id,
json.dumps({
'acquired_at': int(time.time()) - 600,
'last_heartbeat': int(time.time()) - 400, # 400s ago
'user_email': f'load_test_{i}@test.com'
})
)
# Run: locust -f test_zombie_cleanup_load.py --host http://localhost:8000
Related ADRs
- ADR-001: Floating vs Node-Locked Licenses (establishes floating license pattern requiring zombie cleanup)
- ADR-002: Redis Lua Scripts for Atomic Operations (atomic seat counting prevents race conditions)
- ADR-003: Check-on-Init Enforcement Pattern (session acquisition workflow)
- ADR-008: Offline Grace Period Implementation (TTL reasoning and offline detection)
- ADR-012: License Expiration and Renewal (complementary lifecycle management)
References
- Celery Periodic Tasks
- Redis EXPIRE vs Application-Level TTL
- Prometheus Best Practices for Metrics
- Grafana Alerting
- SLASCONE Zombie Session Handling
Last Updated: 2025-11-30 Owner: Backend Team Review Cycle: Quarterly or on significant session management changes Related Systems: Celery, Redis, Prometheus, Grafana