Skip to main content

ADR-014: Cloud SQL IAM Authentication

FieldValue
StatusProposed
Date2025-12-29
Decision MakersHal Casteel (CEO/CTO), Security Team
SupersedesN/A
Related ADRsADR-011 (Authentication Strategy), ADR-012 (Data Isolation)

Executive Summary

We will use GCP Workload Identity + Cloud SQL IAM Authentication as the database authentication mechanism for CODITECT Document Management System, providing:

  • Zero Passwords: No database credentials to store, rotate, or leak
  • Automatic Rotation: GCP manages all credential lifecycle
  • IAM Integration: Fine-grained access control via GCP IAM
  • Audit Trail: Complete authentication logging via Cloud Audit Logs
  • Compliance: SOC 2, ISO 27001, HIPAA, FedRAMP compliant

Decision: Workload Identity + Cloud SQL IAM Auth over password-based authentication, Cloud SQL Proxy with service account keys, or application-level password management.


Context and Problem Statement

CODITECT DMS requires secure database authentication that:

  1. Eliminates credential exposure - No passwords in environment variables, secrets, or configuration files
  2. Provides automatic rotation - No manual credential rotation required
  3. Integrates with GCP IAM - Leverage existing identity infrastructure
  4. Supports audit compliance - Full authentication audit trail
  5. Works with async Python - Compatible with asyncpg for SQLAlchemy async operations

Current Problem

Password-based authentication presents several security challenges:

  1. URL Encoding Issues: Special characters in passwords (e.g., [, (, $, <) cause URL parsing failures
  2. Credential Storage: Passwords stored in Kubernetes Secrets (base64, not encrypted by default)
  3. Rotation Complexity: Manual password rotation across multiple environments
  4. Exposure Risk: Environment variables visible in pod descriptions and logs
  5. Compliance Burden: Additional documentation required for password management

Requirements

Functional:

  • Authenticate GKE workloads to Cloud SQL without passwords
  • Support PostgreSQL with asyncpg driver
  • Enable per-service-account access control
  • Provide connection pooling compatibility

Non-Functional:

  • Zero credential exposure
  • < 100ms authentication latency
  • 99.99% availability (GCP SLA)
  • SOC 2 Type II compliant
  • Automatic credential rotation

Security:

  • No secrets in environment variables
  • No secrets in Kubernetes manifests
  • Full audit logging
  • Principle of least privilege

Decision Drivers

  1. Security First: Eliminate all stored credentials
  2. GCP Native: Leverage platform identity infrastructure
  3. Operational Simplicity: Zero credential management overhead
  4. Compliance: Meet enterprise security requirements
  5. Developer Experience: Transparent authentication for applications

Considered Options

Option 1: Workload Identity + Cloud SQL IAM Authentication (CHOSEN)

Architecture:

GKE Pod → Workload Identity → GCP Service Account → IAM Auth → Cloud SQL

Pros:

  • Zero Passwords: No credentials to store, rotate, or leak
  • GCP Native: Direct integration with IAM, Cloud SQL, GKE
  • Automatic Rotation: GCP manages token lifecycle (1-hour tokens)
  • Fine-Grained Access: Per-service-account database permissions
  • Audit Trail: Cloud Audit Logs capture all authentication events
  • Compliance: Inherits GCP certifications (SOC 2, ISO 27001, HIPAA, FedRAMP)
  • Transparent: Application code unchanged (uses standard connection string)

Cons:

  • GKE Requirement: Only works with GKE Workload Identity
  • Setup Complexity: Initial configuration more complex than passwords
  • asyncpg Compatibility: Requires Cloud SQL Python Connector
  • Latency: Initial connection ~50-100ms (token acquisition)

Cost:

  • No additional cost (included with Cloud SQL and GKE)

Option 2: Cloud SQL Proxy with Service Account Key

Architecture:

GKE Pod → Cloud SQL Proxy Sidecar → Unix Socket → Cloud SQL

Pros:

  • Established Pattern: Well-documented, widely used
  • No Password in Connection String: Uses Unix socket
  • Works with any asyncpg setup: Standard PostgreSQL connection

Cons:

  • Service Account Key: JSON key file is a credential that can leak
  • Key Rotation: Manual key rotation required
  • Sidecar Overhead: Additional container per pod
  • Resource Usage: Proxy uses CPU/memory

Cost:

  • No additional cost, but operational overhead for key management

Option 3: Password-Based Authentication (Current)

Architecture:

GKE Pod → Environment Variable → Direct TCP Connection → Cloud SQL

Pros:

  • Simple Setup: Standard PostgreSQL connection string
  • Driver Agnostic: Works with any PostgreSQL driver
  • No GCP Dependencies: Portable to other clouds

Cons:

  • Password Storage: Credentials in Kubernetes Secrets
  • URL Encoding Issues: Special characters cause parsing failures
  • Rotation Burden: Manual password rotation
  • Exposure Risk: Environment variables visible in pod descriptions
  • Compliance Overhead: Additional documentation and controls required

Cost:

  • Low direct cost, high operational/compliance cost

Option 4: HashiCorp Vault Dynamic Secrets

Architecture:

GKE Pod → Vault Agent → Dynamic PostgreSQL Credentials → Cloud SQL

Pros:

  • Dynamic Credentials: Short-lived credentials generated on demand
  • Automatic Rotation: Vault handles credential lifecycle
  • Multi-Cloud: Works with any cloud provider
  • Centralized: Single secrets management platform

Cons:

  • Additional Infrastructure: Vault cluster required
  • Complexity: Significant operational overhead
  • Cost: Vault Enterprise licensing ($$$)
  • Latency: Additional network hop for credential retrieval

Cost:

  • Vault Enterprise: ~$30,000/year
  • Infrastructure: ~$500/month (HA cluster)

Decision Outcome

Chosen Option: Workload Identity + Cloud SQL IAM Authentication

Rationale:

  1. Maximum Security: Eliminates all stored credentials - nothing to leak
  2. Zero Operational Overhead: GCP manages entire credential lifecycle
  3. Native Integration: Seamless with existing GKE and Cloud SQL infrastructure
  4. Compliance Ready: Inherits GCP's enterprise certifications
  5. Cost Effective: No additional licensing or infrastructure costs
  6. Future Proof: Google's recommended authentication pattern

Trade-offs Accepted:

  • Vendor lock-in to GCP (acceptable given existing GCP commitment)
  • Initial setup complexity (one-time cost)
  • Requires Cloud SQL Python Connector (minimal code change)

Architecture

Authentication Flow

Component Architecture

Security Model


Implementation Details

Step 1: Enable Required APIs

# Enable required GCP APIs
gcloud services enable \
sqladmin.googleapis.com \
iam.googleapis.com \
iamcredentials.googleapis.com \
--project=coditect-cloud-infra

Step 2: Configure Workload Identity

# Variables
PROJECT_ID="coditect-cloud-infra"
CLUSTER_NAME="coditect-dms-dev-cluster"
REGION="us-central1"
K8S_NAMESPACE="coditect-dms"
K8S_SA="coditect-dms-api"
GCP_SA="coditect-dms-api-sa"

# Verify Workload Identity is enabled on cluster
gcloud container clusters describe $CLUSTER_NAME \
--region=$REGION \
--project=$PROJECT_ID \
--format="value(workloadIdentityConfig.workloadPool)"

# Create GCP Service Account (if not exists)
gcloud iam service-accounts create $GCP_SA \
--display-name="CODITECT DMS API Service Account" \
--project=$PROJECT_ID

# Grant Cloud SQL Client role
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"

# Allow K8s SA to impersonate GCP SA (Workload Identity binding)
gcloud iam service-accounts add-iam-policy-binding \
${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[${K8S_NAMESPACE}/${K8S_SA}]" \
--project=$PROJECT_ID

# Annotate Kubernetes Service Account
kubectl annotate serviceaccount $K8S_SA \
--namespace=$K8S_NAMESPACE \
iam.gke.io/gcp-service-account=${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com

Step 3: Enable IAM Authentication on Cloud SQL

# Enable IAM authentication on Cloud SQL instance
gcloud sql instances patch coditect-dms-dev-db-0c7d1bf6 \
--database-flags=cloudsql.iam_authentication=on \
--project=$PROJECT_ID

# Create IAM database user
gcloud sql users create ${GCP_SA}@${PROJECT_ID}.iam \
--instance=coditect-dms-dev-db-0c7d1bf6 \
--type=CLOUD_IAM_SERVICE_ACCOUNT \
--project=$PROJECT_ID

Step 4: Grant PostgreSQL Permissions

-- Connect to Cloud SQL instance as postgres admin
-- Grant permissions to IAM user

-- Create application role (if not exists)
CREATE ROLE dms_app;
GRANT CONNECT ON DATABASE coditect_dms TO dms_app;
GRANT USAGE ON SCHEMA public TO dms_app;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO dms_app;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO dms_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO dms_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO dms_app;

-- Grant role to IAM user
GRANT dms_app TO "coditect-dms-api-sa@coditect-cloud-infra.iam";

Step 5: Update Application Code

# database.py - Updated for IAM Authentication
import os
from google.cloud.sql.connector import Connector, IPTypes
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
import asyncpg

# Configuration
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "coditect-cloud-infra")
REGION = os.environ.get("CLOUD_SQL_REGION", "us-central1")
INSTANCE_NAME = os.environ.get("CLOUD_SQL_INSTANCE", "coditect-dms-dev-db-0c7d1bf6")
DATABASE_NAME = os.environ.get("DATABASE_NAME", "coditect_dms")
IAM_USER = os.environ.get("IAM_DATABASE_USER", "coditect-dms-api-sa@coditect-cloud-infra.iam")

# Connection string for Cloud SQL with IAM auth
INSTANCE_CONNECTION_NAME = f"{PROJECT_ID}:{REGION}:{INSTANCE_NAME}"


async def get_iam_connection():
"""Get database connection using IAM authentication.

Uses Google Cloud SQL Python Connector for automatic IAM token management.
No passwords required - authentication via Workload Identity.
"""
connector = Connector()

async def getconn():
conn = await connector.connect_async(
INSTANCE_CONNECTION_NAME,
"asyncpg",
user=IAM_USER,
db=DATABASE_NAME,
enable_iam_auth=True,
ip_type=IPTypes.PRIVATE, # Use private IP within VPC
)
return conn

return getconn


async def create_engine_with_iam():
"""Create SQLAlchemy async engine with IAM authentication."""
getconn = await get_iam_connection()

engine = create_async_engine(
"postgresql+asyncpg://",
async_creator=getconn,
pool_size=5,
max_overflow=10,
pool_timeout=30,
pool_recycle=1800, # Recycle connections every 30 minutes
pool_pre_ping=True, # Verify connections before use
)

return engine


# Session factory
async def get_session_factory():
"""Get async session factory for database operations."""
engine = await create_engine_with_iam()
return sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False,
)


# FastAPI dependency
async def get_db():
"""FastAPI dependency for database sessions."""
session_factory = await get_session_factory()
async with session_factory() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise
finally:
await session.close()

Step 6: Update Kubernetes Deployment

# deployment.yaml - Updated for Workload Identity
apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-dms-api
namespace: coditect-dms
spec:
template:
spec:
serviceAccountName: coditect-dms-api # K8s SA with Workload Identity
containers:
- name: api
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-dms/api:dev-v3
env:
# No database password required!
- name: GCP_PROJECT_ID
value: "coditect-cloud-infra"
- name: CLOUD_SQL_REGION
value: "us-central1"
- name: CLOUD_SQL_INSTANCE
value: "coditect-dms-dev-db-0c7d1bf6"
- name: DATABASE_NAME
value: "coditect_dms"
- name: IAM_DATABASE_USER
value: "coditect-dms-api-sa@coditect-cloud-infra.iam"
# Other non-sensitive configuration
- name: API_ENVIRONMENT
value: "development"
- name: API_HOST
value: "0.0.0.0"
- name: API_PORT
value: "8000"

Step 7: Update requirements.txt

# requirements.txt - Add Cloud SQL Connector
cloud-sql-python-connector[asyncpg]>=1.0.0
google-auth>=2.0.0

Security Benefits

Credential Exposure Eliminated

Attack VectorPassword AuthIAM Auth
Secret in K8s manifestExposedN/A
Env variable in podExposedN/A
Pod description leakExposedN/A
Log file exposurePossibleN/A
Memory dumpPossibleToken only (1hr)
Network sniffingPossible (if not TLS)Token only (1hr)

Compliance Improvements

RequirementPassword AuthIAM Auth
SOC 2 Credential ManagementManual controlsAutomated
ISO 27001 Access ControlManual rotationAutomatic
HIPAA Access LoggingPartialComplete
PCI DSS Key Rotation90-day manualContinuous

Audit Trail

IAM authentication provides complete audit logging via Cloud Audit Logs:

{
"logName": "projects/coditect-cloud-infra/logs/cloudaudit.googleapis.com%2Fdata_access",
"resource": {
"type": "cloudsql_database",
"labels": {
"database_id": "coditect-cloud-infra:coditect-dms-dev-db",
"region": "us-central1"
}
},
"protoPayload": {
"methodName": "cloudsql.instances.connect",
"authenticationInfo": {
"principalEmail": "coditect-dms-api-sa@coditect-cloud-infra.iam.gserviceaccount.com"
},
"requestMetadata": {
"callerIp": "10.1.6.26",
"callerSuppliedUserAgent": "cloud-sql-python-connector/1.0.0"
}
}
}

Rollback Plan

If IAM authentication encounters issues, rollback to password-based authentication:

  1. Revert database.py to use direct connection string
  2. Recreate Kubernetes secret with URL-encoded password
  3. Update deployment to use password from secret
  4. Restart pods
# Emergency rollback script
kubectl --context=gke_coditect-cloud-infra_us-central1_coditect-dms-dev-cluster \
create secret generic coditect-dms-secrets \
--namespace=coditect-dms \
--from-literal=database-url="postgresql+asyncpg://..." \
--dry-run=client -o yaml | kubectl apply -f -

kubectl rollout restart deployment -n coditect-dms

Monitoring and Alerts

Key Metrics

# Prometheus alerts for IAM authentication
groups:
- name: cloudsql-iam-auth
rules:
- alert: CloudSQLIAMAuthFailure
expr: rate(cloudsql_database_auth_failures_total{auth_type="iam"}[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Cloud SQL IAM authentication failures"
description: "{{ $value }} IAM auth failures in last 5 minutes"

- alert: WorkloadIdentityTokenExpiry
expr: workload_identity_token_expiry_seconds < 300
for: 1m
labels:
severity: warning
annotations:
summary: "Workload Identity token expiring soon"
description: "Token expires in {{ $value }} seconds"

Dashboard

Create Cloud Monitoring dashboard for:

  • IAM authentication success/failure rate
  • Token refresh frequency
  • Connection pool utilization
  • Query latency by service account

Migration Checklist

Phase 1: Infrastructure Setup (Day 1)

  • Verify Workload Identity enabled on GKE cluster
  • Create GCP service account for DMS API
  • Grant Cloud SQL Client role
  • Configure Workload Identity binding
  • Enable IAM authentication on Cloud SQL instance
  • Create IAM database user

Phase 2: Application Update (Day 2)

  • Add cloud-sql-python-connector to requirements.txt
  • Update database.py with IAM authentication code
  • Remove password from environment variables
  • Update Kubernetes deployment manifest
  • Build and push new Docker image

Phase 3: Deployment (Day 3)

  • Deploy updated application
  • Verify database connectivity
  • Run health checks
  • Monitor for authentication errors
  • Update documentation

Phase 4: Cleanup (Day 4)

  • Remove password from GCP Secret Manager
  • Delete Kubernetes secret (coditect-dms-secrets database-url)
  • Update security documentation
  • Archive old credentials securely

Open Questions

  • Connection Pooling: Should we use Cloud SQL connection pooling in addition to SQLAlchemy pooling? (Performance testing: Week 1)
  • Multi-Region: How does IAM auth work with Cloud SQL multi-region replicas? (Architecture review: Week 2)
  • Local Development: Best approach for local development without GKE? (Developer experience: Week 1)

References

  1. Cloud SQL IAM Database Authentication
  2. Workload Identity for GKE
  3. Cloud SQL Python Connector
  4. GKE Workload Identity Best Practices
  5. Cloud SQL Security Best Practices

Decision: Proposed for approval Next Steps: Implement Phase 1 infrastructure setup Review Date: January 5, 2026