Deployment Troubleshooting - Lessons Learned

Date: December 2, 2025 Context: End-to-end user registration deployment to staging

Summary

This document captures all issues discovered while deploying the complete user registration flow to staging. These issues were discovered incrementally while attempting to apply Django migrations, demonstrating the importance of holistic problem-solving.

Issues Discovered and Resolved

1. Frontend Field Name Mismatch

Status: ✅ Fixed Location: frontend/src/types/index.ts, frontend/src/pages/Dashboard.tsx

Problem:

Frontend sending name and company, backend expecting full_name and company_name
Result: HTTP 400 errors on registration

Root Cause: Frontend TypeScript types didn't match Django model field names

Solution: Updated all frontend types and components to use snake_case field names matching backend

Files Changed:

src/types/index.ts - Updated RegisterFormData and User interfaces
src/pages/Register.tsx - Updated form field names
src/pages/Dashboard.tsx - Updated display field names

Prevention:

Create contract tests between frontend/backend
Generate TypeScript types from Django models automatically
Add API schema validation

2. Database Schema Missing Fields

Status: ⏸️ Blocked by migration issues Location: Backend database, tenants/models.py

Problem:

Database missing stripe_customer_id column on organizations table
Result: HTTP 500 errors on registration

Root Cause: Django migrations not applied to staging database

Files Changed:

Created scripts/apply-migrations.sh - Migration deployment script
Created k8s/migration-job.yaml - Kubernetes Job for migrations

3. Wrong Nodepool Name in Migration Job

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

Migration Job pod stuck in "Pending" status
Error: "node(s) didn't match Pod's node affinity/selector"

Root Cause: Node affinity specified "default-pool" but cluster uses "primary-node-pool"

Solution: Removed node affinity entirely (unnecessary for migration job)

Lesson Learned: Always check actual cluster node labels before hardcoding affinity rules

Investigation Commands:

# Check actual nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'

# Describe pod to see scheduling failures
kubectl describe pod -n NAMESPACE POD_NAME

4. Wrong Docker Registry

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

Initially tried gcr.io/... (Google Container Registry)
Actual images in us-central1-docker.pkg.dev/... (Artifact Registry)

Root Cause: Assumed GCR when project uses Artifact Registry

Solution: Checked deployment's actual image location and matched it

Lesson Learned: Always verify actual image location from running deployment, don't assume

Investigation Commands:

# Get actual image from deployment
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'

5. Wrong Secret Key Names

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

Migration job referenced database-user and database-password
Actual secret keys are db-user and db-password
Error: "couldn't find key database-user in Secret"

Root Cause: Guessed secret key names instead of checking actual secret

Solution: Listed secret keys and matched deployment configuration exactly

Lesson Learned: Always inspect actual Kubernetes secrets before referencing them

Investigation Commands:

# List actual secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'

# Check deployment's secret references
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 5 "secretKeyRef"

Correct Environment Variables:

DB_NAME (from secret key db-name)
DB_USER (from secret key db-user)
DB_PASSWORD (from secret key db-password)
DB_HOST (from secret key db-host)
DB_PORT (hardcoded "5432")

6. Missing DJANGO_SECRET_KEY

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

Migration job failed with "DJANGO_SECRET_KEY environment variable must be set"
Deployment has it, but migration job didn't

Root Cause: Copied only database-related env vars, missed Django-required ones

Solution: Added DJANGO_SECRET_KEY from secret (key: django-secret-key)

Lesson Learned: Copy ALL environment variables from working deployment to migration job

Prevention: Create a shared environment variable ConfigMap for common values

7. Wrong Django Settings Module

Status: ✅ Fixed Location: k8s/migration-job.yaml

Problem:

Initially used DJANGO_SETTINGS_MODULE: "core.settings"
Deployment uses "license_platform.settings.staging"

Root Cause: Guessed settings module name

Solution: Matched deployment's exact DJANGO_SETTINGS_MODULE value

Lesson Learned: Django settings module must match exactly across all components

8. Django Migration Conflict

Status: ⏸️ In Progress Location: licenses/migrations/

Problem:

Two migrations numbered 0002_: 0002_initial.py and 0002_add_renewal_fields.py
Created parallel branches in migration graph
Error: "Conflicting migrations detected; multiple leaf nodes"

Root Cause: Migrations created in parallel (different branches or developers) without coordination

Solution: Created merge migration 0007_merge_20251202_0703.py using:

python manage.py makemigrations --merge --noinput licenses

Next Steps:

Commit merge migration
Build new Docker image
Update migration job to use new image
Run migrations

Lesson Learned:

Always pull latest migrations before creating new ones
Use migration conflict detection in CI/CD
Consider migration locking mechanism for parallel development

Prevention:

Pre-commit hook to check for conflicting migrations
CI check that fails if multiple leaf nodes exist
Team communication about migration creation

Holistic Issue Grouping

Configuration Consistency Issues

Related issues that share a root cause:

Wrong secret key names (#5)
Missing DJANGO_SECRET_KEY (#6)
Wrong Django settings module (#7)

Underlying Problem: Migration job configuration not matching deployment configuration

Holistic Solution:

Create a script that generates migration job YAML from deployment YAML
Extract common environment variables into shared ConfigMap
Use Helm charts or Kustomize for consistent configuration

Infrastructure Assumptions

Related issues that stem from assumptions:

Wrong nodepool name (#3)
Wrong Docker registry (#4)

Underlying Problem: Hardcoded infrastructure assumptions instead of dynamic discovery

Holistic Solution:

Always inspect actual cluster state before creating manifests
Document actual infrastructure (nodepool names, registry locations)
Use terraform/OpenTofu outputs for infrastructure values

Code Synchronization Issues

Related issues from parallel development:

Frontend field name mismatch (#1)
Django migration conflict (#8)

Underlying Problem: Lack of synchronization between frontend/backend and parallel migration creation

Holistic Solution:

Generate frontend types from backend schemas
Implement pre-commit hooks for migration conflict detection
Add contract tests between frontend and backend
Establish migration creation workflow (single source of truth)

Prevention Strategies

1. Configuration Management

Extract common environment variables into ConfigMap
Create script to generate migration job from deployment
Use Helm charts for configuration consistency
Add validation that migration job env matches deployment env

2. Testing & Validation

Add contract tests between frontend and backend
Add migration conflict detection to CI/CD
Add pre-commit hook to check for migration conflicts
Test migrations in isolated environment before deployment

3. Documentation

Document actual infrastructure (nodepools, registries, secret keys)
Create deployment runbook with verification steps
Document Django settings modules for each environment
Maintain this troubleshooting guide

4. Development Workflow

Establish migration creation workflow
Implement migration locking for parallel development
Generate TypeScript types from Django models
Add API schema validation

Quick Reference: Investigation Commands

# Check nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'

# Get deployment's actual image
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'

# List secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'

# Check deployment environment variables
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 10 "env:"

# Check for Django migration conflicts
python manage.py showmigrations --list

# Describe failing pod
kubectl describe pod POD_NAME -n NAMESPACE

# Get pod logs
kubectl logs POD_NAME -n NAMESPACE --tail=100

Current Status

Resolved Issues

✅ Frontend field name mismatch
✅ Migration job nodepool affinity
✅ Docker registry location
✅ Secret key names
✅ Missing DJANGO_SECRET_KEY
✅ Django settings module name
✅ Merge migration created

Remaining Work

⏸️ Commit merge migration
⏸️ Build new Docker image with merge migration
⏸️ Push image to Artifact Registry
⏸️ Update migration job to use new image
⏸️ Run migration job successfully
⏸️ Test end-to-end registration flow
⏸️ Create backend/frontend contract tests

Appendix: All Files Modified

Backend Files

scripts/apply-migrations.sh (created)
k8s/migration-job.yaml (created, fixed 6 times)
licenses/migrations/0007_merge_20251202_0703.py (created)

Frontend Files (from previous session)

src/types/index.ts
src/pages/Register.tsx
src/pages/Dashboard.tsx

Documentation Files

docs/troubleshooting-deployment-issues.md (this file)

Conclusion: This holistic documentation captures not just individual fixes, but the patterns and underlying causes that led to multiple related issues. By grouping related problems and documenting prevention strategies, we ensure these issues don't recur.

Next Session: Use this document as a reference when encountering similar deployment issues.

Summary​

Issues Discovered and Resolved​

1. Frontend Field Name Mismatch​

2. Database Schema Missing Fields​

3. Wrong Nodepool Name in Migration Job​

4. Wrong Docker Registry​

5. Wrong Secret Key Names​

6. Missing DJANGO_SECRET_KEY​

7. Wrong Django Settings Module​

8. Django Migration Conflict​

Holistic Issue Grouping​

Configuration Consistency Issues​

Infrastructure Assumptions​

Code Synchronization Issues​

Prevention Strategies​

1. Configuration Management​

2. Testing & Validation​

3. Documentation​

4. Development Workflow​

Quick Reference: Investigation Commands​

Current Status​

Resolved Issues​

Remaining Work​

Appendix: All Files Modified​

Backend Files​

Frontend Files (from previous session)​

Documentation Files​

Summary

Issues Discovered and Resolved

1. Frontend Field Name Mismatch

2. Database Schema Missing Fields

3. Wrong Nodepool Name in Migration Job

4. Wrong Docker Registry

5. Wrong Secret Key Names

6. Missing DJANGO_SECRET_KEY

7. Wrong Django Settings Module

8. Django Migration Conflict

Holistic Issue Grouping

Configuration Consistency Issues

Infrastructure Assumptions

Code Synchronization Issues

Prevention Strategies

1. Configuration Management

2. Testing & Validation

3. Documentation

4. Development Workflow

Quick Reference: Investigation Commands

Current Status

Resolved Issues

Remaining Work

Appendix: All Files Modified

Backend Files

Frontend Files (from previous session)

Documentation Files