Deployment Troubleshooting - Lessons Learned
Date: December 2, 2025 Context: End-to-end user registration deployment to staging
Summary
This document captures all issues discovered while deploying the complete user registration flow to staging. These issues were discovered incrementally while attempting to apply Django migrations, demonstrating the importance of holistic problem-solving.
Issues Discovered and Resolved
1. Frontend Field Name Mismatch
Status: ✅ Fixed Location: frontend/src/types/index.ts, frontend/src/pages/Dashboard.tsx
Problem:
- Frontend sending
nameandcompany, backend expectingfull_nameandcompany_name - Result: HTTP 400 errors on registration
Root Cause: Frontend TypeScript types didn't match Django model field names
Solution: Updated all frontend types and components to use snake_case field names matching backend
Files Changed:
src/types/index.ts- Updated RegisterFormData and User interfacessrc/pages/Register.tsx- Updated form field namessrc/pages/Dashboard.tsx- Updated display field names
Prevention:
- Create contract tests between frontend/backend
- Generate TypeScript types from Django models automatically
- Add API schema validation
2. Database Schema Missing Fields
Status: ⏸️ Blocked by migration issues Location: Backend database, tenants/models.py
Problem:
- Database missing
stripe_customer_idcolumn on organizations table - Result: HTTP 500 errors on registration
Root Cause: Django migrations not applied to staging database
Files Changed:
- Created
scripts/apply-migrations.sh- Migration deployment script - Created
k8s/migration-job.yaml- Kubernetes Job for migrations
3. Wrong Nodepool Name in Migration Job
Status: ✅ Fixed Location: k8s/migration-job.yaml
Problem:
- Migration Job pod stuck in "Pending" status
- Error: "node(s) didn't match Pod's node affinity/selector"
Root Cause: Node affinity specified "default-pool" but cluster uses "primary-node-pool"
Solution: Removed node affinity entirely (unnecessary for migration job)
Lesson Learned: Always check actual cluster node labels before hardcoding affinity rules
Investigation Commands:
# Check actual nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'
# Describe pod to see scheduling failures
kubectl describe pod -n NAMESPACE POD_NAME
4. Wrong Docker Registry
Status: ✅ Fixed Location: k8s/migration-job.yaml
Problem:
- Initially tried
gcr.io/...(Google Container Registry) - Actual images in
us-central1-docker.pkg.dev/...(Artifact Registry)
Root Cause: Assumed GCR when project uses Artifact Registry
Solution: Checked deployment's actual image location and matched it
Lesson Learned: Always verify actual image location from running deployment, don't assume
Investigation Commands:
# Get actual image from deployment
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'
5. Wrong Secret Key Names
Status: ✅ Fixed Location: k8s/migration-job.yaml
Problem:
- Migration job referenced
database-useranddatabase-password - Actual secret keys are
db-useranddb-password - Error: "couldn't find key database-user in Secret"
Root Cause: Guessed secret key names instead of checking actual secret
Solution: Listed secret keys and matched deployment configuration exactly
Lesson Learned: Always inspect actual Kubernetes secrets before referencing them
Investigation Commands:
# List actual secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'
# Check deployment's secret references
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 5 "secretKeyRef"
Correct Environment Variables:
DB_NAME(from secret keydb-name)DB_USER(from secret keydb-user)DB_PASSWORD(from secret keydb-password)DB_HOST(from secret keydb-host)DB_PORT(hardcoded "5432")
6. Missing DJANGO_SECRET_KEY
Status: ✅ Fixed Location: k8s/migration-job.yaml
Problem:
- Migration job failed with "DJANGO_SECRET_KEY environment variable must be set"
- Deployment has it, but migration job didn't
Root Cause: Copied only database-related env vars, missed Django-required ones
Solution:
Added DJANGO_SECRET_KEY from secret (key: django-secret-key)
Lesson Learned: Copy ALL environment variables from working deployment to migration job
Prevention: Create a shared environment variable ConfigMap for common values
7. Wrong Django Settings Module
Status: ✅ Fixed Location: k8s/migration-job.yaml
Problem:
- Initially used
DJANGO_SETTINGS_MODULE: "core.settings" - Deployment uses
"license_platform.settings.staging"
Root Cause: Guessed settings module name
Solution: Matched deployment's exact DJANGO_SETTINGS_MODULE value
Lesson Learned: Django settings module must match exactly across all components
8. Django Migration Conflict
Status: ⏸️ In Progress Location: licenses/migrations/
Problem:
- Two migrations numbered
0002_:0002_initial.pyand0002_add_renewal_fields.py - Created parallel branches in migration graph
- Error: "Conflicting migrations detected; multiple leaf nodes"
Root Cause: Migrations created in parallel (different branches or developers) without coordination
Solution:
Created merge migration 0007_merge_20251202_0703.py using:
python manage.py makemigrations --merge --noinput licenses
Next Steps:
- Commit merge migration
- Build new Docker image
- Update migration job to use new image
- Run migrations
Lesson Learned:
- Always pull latest migrations before creating new ones
- Use migration conflict detection in CI/CD
- Consider migration locking mechanism for parallel development
Prevention:
- Pre-commit hook to check for conflicting migrations
- CI check that fails if multiple leaf nodes exist
- Team communication about migration creation
Holistic Issue Grouping
Configuration Consistency Issues
Related issues that share a root cause:
- Wrong secret key names (#5)
- Missing DJANGO_SECRET_KEY (#6)
- Wrong Django settings module (#7)
Underlying Problem: Migration job configuration not matching deployment configuration
Holistic Solution:
- Create a script that generates migration job YAML from deployment YAML
- Extract common environment variables into shared ConfigMap
- Use Helm charts or Kustomize for consistent configuration
Infrastructure Assumptions
Related issues that stem from assumptions:
- Wrong nodepool name (#3)
- Wrong Docker registry (#4)
Underlying Problem: Hardcoded infrastructure assumptions instead of dynamic discovery
Holistic Solution:
- Always inspect actual cluster state before creating manifests
- Document actual infrastructure (nodepool names, registry locations)
- Use terraform/OpenTofu outputs for infrastructure values
Code Synchronization Issues
Related issues from parallel development:
- Frontend field name mismatch (#1)
- Django migration conflict (#8)
Underlying Problem: Lack of synchronization between frontend/backend and parallel migration creation
Holistic Solution:
- Generate frontend types from backend schemas
- Implement pre-commit hooks for migration conflict detection
- Add contract tests between frontend and backend
- Establish migration creation workflow (single source of truth)
Prevention Strategies
1. Configuration Management
- Extract common environment variables into ConfigMap
- Create script to generate migration job from deployment
- Use Helm charts for configuration consistency
- Add validation that migration job env matches deployment env
2. Testing & Validation
- Add contract tests between frontend and backend
- Add migration conflict detection to CI/CD
- Add pre-commit hook to check for migration conflicts
- Test migrations in isolated environment before deployment
3. Documentation
- Document actual infrastructure (nodepools, registries, secret keys)
- Create deployment runbook with verification steps
- Document Django settings modules for each environment
- Maintain this troubleshooting guide
4. Development Workflow
- Establish migration creation workflow
- Implement migration locking for parallel development
- Generate TypeScript types from Django models
- Add API schema validation
Quick Reference: Investigation Commands
# Check nodepool names
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.cloud\.google\.com/gke-nodepool}{"\n"}{end}'
# Get deployment's actual image
kubectl get deployment NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'
# List secret keys
kubectl get secret SECRET_NAME -n NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]'
# Check deployment environment variables
kubectl get deployment NAME -n NAMESPACE -o yaml | grep -A 10 "env:"
# Check for Django migration conflicts
python manage.py showmigrations --list
# Describe failing pod
kubectl describe pod POD_NAME -n NAMESPACE
# Get pod logs
kubectl logs POD_NAME -n NAMESPACE --tail=100
Current Status
Resolved Issues
- ✅ Frontend field name mismatch
- ✅ Migration job nodepool affinity
- ✅ Docker registry location
- ✅ Secret key names
- ✅ Missing DJANGO_SECRET_KEY
- ✅ Django settings module name
- ✅ Merge migration created
Remaining Work
- ⏸️ Commit merge migration
- ⏸️ Build new Docker image with merge migration
- ⏸️ Push image to Artifact Registry
- ⏸️ Update migration job to use new image
- ⏸️ Run migration job successfully
- ⏸️ Test end-to-end registration flow
- ⏸️ Create backend/frontend contract tests
Appendix: All Files Modified
Backend Files
scripts/apply-migrations.sh(created)k8s/migration-job.yaml(created, fixed 6 times)licenses/migrations/0007_merge_20251202_0703.py(created)
Frontend Files (from previous session)
src/types/index.tssrc/pages/Register.tsxsrc/pages/Dashboard.tsx
Documentation Files
docs/troubleshooting-deployment-issues.md(this file)
Conclusion: This holistic documentation captures not just individual fixes, but the patterns and underlying causes that led to multiple related issues. By grouping related problems and documenting prevention strategies, we ensure these issues don't recur.
Next Session: Use this document as a reference when encountering similar deployment issues.