Skip to main content

Staging Deployment Status - CODITECT Cloud Backend

Date: November 30, 2025
Status: BLOCKED - GCR Permission Issue
Progress: 90% Complete


✅ Successfully Completed

  1. ✅ Docker image built (Python 3.12.12)

    • Image: coditect-cloud-backend:test-v1.0.0
    • Size: 737MB disk, 136MB content
  2. ✅ Image pushed to GCR

    • Repository: gcr.io/coditect-cloud-infra/coditect-cloud-backend
    • Tag: v1.0.0-staging
    • Digest: sha256:ebca8fb332ffcbcbb6125f6a2b121d5ece38ac47ea97a90872f0c9cbaa3baa69
  3. ✅ Kubernetes manifests created and configured

    • Namespace: coditect-staging (created)
    • Service Account: coditect-cloud-backend (created)
    • Secrets: backend-secrets (created with random values)
    • Services: LoadBalancer + ClusterIP (created)
  4. ✅ GKE cluster verified

    • Cluster: coditect-cluster in us-central1
    • Nodes: 3x n1-standard-2 (RUNNING)
    • Credentials: configured locally

❌ Current Blocker: GCR Image Pull Permissions

Issue Description

Kubernetes pods cannot pull the Docker image from Google Container Registry due to authentication failures.

Error Messages:

403 Forbidden: failed to authorize: failed to fetch oauth token
401 Unauthorized: failed to authorize: failed to fetch oauth token

Attempts Made

  1. Granted Storage Object Viewer role to Compute Engine service account

    gcloud projects add-iam-policy-binding coditect-cloud-infra \
    --member="serviceAccount:374018874256-compute@developer.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

    Result: ❌ Failed - Still getting 403

  2. Created imagePullSecret with user credentials

    kubectl create secret docker-registry gcr-json-key \
    --docker-server=gcr.io \
    --docker-username=_json_key \
    --docker-password="$(gcloud auth print-access-token)" \
    -n coditect-staging

    Result: ❌ Failed - Getting 401/403

Root Cause Analysis

The issue appears to be related to Container Registry API enablement or bucket-level permissions. Possible causes:

  1. Container Registry API not fully initialized for coditect-cloud-infra project
  2. GCS bucket permissions not properly configured for GCR storage
  3. Service account scopes on GKE nodes may not include storage-ro or cloud-platform

# Enable Container Registry API
gcloud services enable containerregistry.googleapis.com --project=coditect-cloud-infra

# Verify API is enabled
gcloud services list --enabled --project=coditect-cloud-infra | grep containerregistry

# Wait 1-2 minutes for API propagation

# Delete existing pods to force retry
kubectl delete pods -n coditect-staging -l app=coditect-backend

Option 2: Use Artifact Registry Instead of GCR

Artifact Registry is the newer, recommended solution:

# Enable Artifact Registry API
gcloud services enable artifactregistry.googleapis.com --project=coditect-cloud-infra

# Create Artifact Registry repository
gcloud artifacts repositories create coditect-backend \
--repository-format=docker \
--location=us-central1 \
--project=coditect-cloud-infra

# Retag and push image to Artifact Registry
docker tag coditect-cloud-backend:test-v1.0.0 \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

docker push us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

# Update deployment manifest
# Change image from:
# gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging
# To:
# us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Option 3: Recreate GKE Cluster with Correct Scopes

# Check current node scopes
gcloud container clusters describe coditect-cluster \
--region=us-central1 \
--format="value(nodeConfig.oauthScopes)"

# If scopes don't include cloud-platform or storage-ro:
# Create new node pool with correct scopes
gcloud container node-pools create coditect-pool-v2 \
--cluster=coditect-cluster \
--region=us-central1 \
--num-nodes=3 \
--machine-type=n1-standard-2 \
--scopes="https://www.googleapis.com/auth/cloud-platform"

# Cordon and drain old nodes
kubectl cordon -l cloud.google.com/gke-nodepool=default-pool
kubectl drain -l cloud.google.com/gke-nodepool=default-pool --force --ignore-daemonsets

# Delete old node pool
gcloud container node-pools delete default-pool \
--cluster=coditect-cluster \
--region=us-central1

Option 4: Use Service Account Key (Quick Fix for Testing)

# Create service account key
gcloud iam service-accounts keys create ~/gcr-key.json \
--iam-account=374018874256-compute@developer.gserviceaccount.com

# Create Kubernetes secret from key
kubectl create secret docker-registry gcr-sa-key \
--docker-server=gcr.io \
--docker-username=_json_key \
--docker-password="$(cat ~/gcr-key.json)" \
--docker-email=1@az1.ai \
-n coditect-staging

# Update deployment to use gcr-sa-key instead of gcr-json-key
kubectl patch deployment coditect-backend -n coditect-staging \
-p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"gcr-sa-key"}]}}}}'

# Clean up key file
rm ~/gcr-key.json

📊 Current Deployment State

Namespace: coditect-staging

  • Status: Created and active
  • Service Account: coditect-cloud-backend (created)
  • Secrets: backend-secrets (created with random Django key and DB password)

Deployment: coditect-backend

  • Status: Running but pods failing to start
  • Replicas: 0/2 ready
  • Reason: ImagePullBackOff

Pods

NAME                                READY   STATUS             RESTARTS   AGE
coditect-backend-5c7689fcbc-mmj7x 0/1 ImagePullBackOff 0 5m
coditect-backend-69b6c8c6d6-92cmb 0/1 ImagePullBackOff 0 30m
coditect-backend-69b6c8c6d6-zr4md 0/1 ImagePullBackOff 0 30m

Services

NAME                          TYPE           EXTERNAL-IP   PORT(S)
coditect-backend LoadBalancer Pending 80:xxxxx/TCP, 443:xxxxx/TCP
coditect-backend-internal ClusterIP 10.x.x.x 8000/TCP

⏭️ Immediate Next Steps

  1. Run Option 1 (Enable Container Registry API) - Fastest fix
  2. Wait 2 minutes for API propagation
  3. Delete pods to force image pull retry
  4. Verify pods start successfully
  5. Get LoadBalancer IP and run smoke tests
  6. Update todos and mark deployment complete

Expected Timeline

  • Container Registry API enablement: 2-3 minutes
  • Pod startup: 1-2 minutes
  • LoadBalancer IP assignment: 1-2 minutes
  • Total: ~5-7 minutes to working deployment

📝 Lessons Learned

  1. Always enable Container Registry API before first docker push
  2. Verify GKE node scopes include cloud-platform or storage-ro
  3. Test image pull locally before deploying to Kubernetes
  4. Use Artifact Registry for new projects (GCR is legacy)

🆘 If All Else Fails

Use Artifact Registry (modern, recommended):

# Full migration script
gcloud services enable artifactregistry.googleapis.com --project=coditect-cloud-infra
gcloud artifacts repositories create coditect-backend \
--repository-format=docker \
--location=us-central1 \
--project=coditect-cloud-infra

docker tag coditect-cloud-backend:test-v1.0.0 \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

docker push us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

# Update deployment
kubectl set image deployment/coditect-backend \
backend=us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
-n coditect-staging

Status: Ready for resolution via Option 1 (recommended)
Estimated Time to Resolution: 5-7 minutes
Last Updated: November 30, 2025