Skip to main content

Deployment Troubleshooting - Lessons Learned

Date: December 2, 2025 Context: End-to-end user registration deployment to staging

Summary

This document captures all issues discovered while deploying the complete user registration flow to staging. These issues were discovered incrementally while attempting to apply Django migrations, demonstrating the importance of holistic problem-solving.


Issues Discovered and Resolved

1. Frontend Field Name Mismatch

Status: ✅ Fixed Location: frontend/src/types/index.ts, frontend/src/pages/Dashboard.tsx

Problem:

  • Frontend sending name and company, backend expecting full_name and company_name
  • Result: HTTP 400 errors on registration

Root Cause: Frontend TypeScript types didn't match Django model field names

Solution: Updated all frontend types and components to use snake_case field names matching backend

Files Changed:

  • src/types/index.ts - Updated RegisterFormData and User interfaces
  • src/pages/Register.tsx - Updated form field names
  • src/pages/Dashboard.tsx - Updated display field names

Prevention:

  • Create contract tests between frontend/backend
  • Generate TypeScript types from Django models automatically
  • Add API schema validation

2. Database Schema Missing Fields

Status: ⏸️ Blocked by migration issues Location: Backend database, tenants/models.py

Problem:

  • Database missing stripe_customer_id column on organizations table
  • Result: HTTP 500 errors on registration

Root Cause: Django migrations not applied to staging database

Files Changed:

  • Created scripts/apply-migrations.sh - Migration deployment script
  • Created k8s/migration-job.yaml - Kubernetes Job for migrations

3. Wrong Nodepool Name in Migration Job

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

  • Migration Job pod stuck in "Pending" status
  • Error: "node(s) didn't match Pod's node affinity/selector"

Root Cause: Node affinity specified "default-pool" but cluster uses "primary-node-pool"

Solution: Removed node affinity entirely (unnecessary for migration job)

Lesson Learned: Always check actual cluster node labels before hardcoding affinity rules

Investigation Commands:

# Check actual nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'

# Describe pod to see scheduling failures
kubectl describe pod -n NAMESPACE POD_NAME

4. Wrong Docker Registry

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

  • Initially tried gcr.io/... (Google Container Registry)
  • Actual images in us-central1-docker.pkg.dev/... (Artifact Registry)

Root Cause: Assumed GCR when project uses Artifact Registry

Solution: Checked deployment's actual image location and matched it

Lesson Learned: Always verify actual image location from running deployment, don't assume

Investigation Commands:

# Get actual image from deployment
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'

5. Wrong Secret Key Names

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

  • Migration job referenced database-user and database-password
  • Actual secret keys are db-user and db-password
  • Error: "couldn't find key database-user in Secret"

Root Cause: Guessed secret key names instead of checking actual secret

Solution: Listed secret keys and matched deployment configuration exactly

Lesson Learned: Always inspect actual Kubernetes secrets before referencing them

Investigation Commands:

# List actual secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'

# Check deployment's secret references
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 5 "secretKeyRef"

Correct Environment Variables:

  • DB_NAME (from secret key db-name)
  • DB_USER (from secret key db-user)
  • DB_PASSWORD (from secret key db-password)
  • DB_HOST (from secret key db-host)
  • DB_PORT (hardcoded "5432")

6. Missing DJANGO_SECRET_KEY

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

  • Migration job failed with "DJANGO_SECRET_KEY environment variable must be set"
  • Deployment has it, but migration job didn't

Root Cause: Copied only database-related env vars, missed Django-required ones

Solution: Added DJANGO_SECRET_KEY from secret (key: django-secret-key)

Lesson Learned: Copy ALL environment variables from working deployment to migration job

Prevention: Create a shared environment variable ConfigMap for common values


7. Wrong Django Settings Module

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

  • Initially used DJANGO_SETTINGS_MODULE: "core.settings"
  • Deployment uses "license_platform.settings.staging"

Root Cause: Guessed settings module name

Solution: Matched deployment's exact DJANGO_SETTINGS_MODULE value

Lesson Learned: Django settings module must match exactly across all components


8. Django Migration Conflict

Status: ⏸️ In Progress Location: licenses/migrations/

Problem:

  • Two migrations numbered 0002_: 0002_initial.py and 0002_add_renewal_fields.py
  • Created parallel branches in migration graph
  • Error: "Conflicting migrations detected; multiple leaf nodes"

Root Cause: Migrations created in parallel (different branches or developers) without coordination

Solution: Created merge migration 0007_merge_20251202_0703.py using:

python manage.py makemigrations --merge --noinput licenses

Next Steps:

  1. Commit merge migration
  2. Build new Docker image
  3. Update migration job to use new image
  4. Run migrations

Lesson Learned:

  • Always pull latest migrations before creating new ones
  • Use migration conflict detection in CI/CD
  • Consider migration locking mechanism for parallel development

Prevention:

  • Pre-commit hook to check for conflicting migrations
  • CI check that fails if multiple leaf nodes exist
  • Team communication about migration creation

Holistic Issue Grouping

Configuration Consistency Issues

Related issues that share a root cause:

  • Wrong secret key names (#5)
  • Missing DJANGO_SECRET_KEY (#6)
  • Wrong Django settings module (#7)

Underlying Problem: Migration job configuration not matching deployment configuration

Holistic Solution:

  1. Create a script that generates migration job YAML from deployment YAML
  2. Extract common environment variables into shared ConfigMap
  3. Use Helm charts or Kustomize for consistent configuration

Infrastructure Assumptions

Related issues that stem from assumptions:

  • Wrong nodepool name (#3)
  • Wrong Docker registry (#4)

Underlying Problem: Hardcoded infrastructure assumptions instead of dynamic discovery

Holistic Solution:

  1. Always inspect actual cluster state before creating manifests
  2. Document actual infrastructure (nodepool names, registry locations)
  3. Use terraform/OpenTofu outputs for infrastructure values

Code Synchronization Issues

Related issues from parallel development:

  • Frontend field name mismatch (#1)
  • Django migration conflict (#8)

Underlying Problem: Lack of synchronization between frontend/backend and parallel migration creation

Holistic Solution:

  1. Generate frontend types from backend schemas
  2. Implement pre-commit hooks for migration conflict detection
  3. Add contract tests between frontend and backend
  4. Establish migration creation workflow (single source of truth)

Prevention Strategies

1. Configuration Management

  • Extract common environment variables into ConfigMap
  • Create script to generate migration job from deployment
  • Use Helm charts for configuration consistency
  • Add validation that migration job env matches deployment env

2. Testing & Validation

  • Add contract tests between frontend and backend
  • Add migration conflict detection to CI/CD
  • Add pre-commit hook to check for migration conflicts
  • Test migrations in isolated environment before deployment

3. Documentation

  • Document actual infrastructure (nodepools, registries, secret keys)
  • Create deployment runbook with verification steps
  • Document Django settings modules for each environment
  • Maintain this troubleshooting guide

4. Development Workflow

  • Establish migration creation workflow
  • Implement migration locking for parallel development
  • Generate TypeScript types from Django models
  • Add API schema validation

Quick Reference: Investigation Commands

# Check nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'

# Get deployment's actual image
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'

# List secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'

# Check deployment environment variables
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 10 "env:"

# Check for Django migration conflicts
python manage.py showmigrations --list

# Describe failing pod
kubectl describe pod POD_NAME -n NAMESPACE

# Get pod logs
kubectl logs POD_NAME -n NAMESPACE --tail=100

Current Status

Resolved Issues

  1. ✅ Frontend field name mismatch
  2. ✅ Migration job nodepool affinity
  3. ✅ Docker registry location
  4. ✅ Secret key names
  5. ✅ Missing DJANGO_SECRET_KEY
  6. ✅ Django settings module name
  7. ✅ Merge migration created

Remaining Work

  1. ⏸️ Commit merge migration
  2. ⏸️ Build new Docker image with merge migration
  3. ⏸️ Push image to Artifact Registry
  4. ⏸️ Update migration job to use new image
  5. ⏸️ Run migration job successfully
  6. ⏸️ Test end-to-end registration flow
  7. ⏸️ Create backend/frontend contract tests

Appendix: All Files Modified

Backend Files

  • scripts/apply-migrations.sh (created)
  • k8s/migration-job.yaml (created, fixed 6 times)
  • licenses/migrations/0007_merge_20251202_0703.py (created)

Frontend Files (from previous session)

  • src/types/index.ts
  • src/pages/Register.tsx
  • src/pages/Dashboard.tsx

Documentation Files

  • docs/troubleshooting-deployment-issues.md (this file)

Conclusion: This holistic documentation captures not just individual fixes, but the patterns and underlying causes that led to multiple related issues. By grouping related problems and documenting prevention strategies, we ensure these issues don't recur.

Next Session: Use this document as a reference when encountering similar deployment issues.