ADR-014: Cloud SQL IAM Authentication
| Field | Value |
|---|---|
| Status | Proposed |
| Date | 2025-12-29 |
| Decision Makers | Hal Casteel (CEO/CTO), Security Team |
| Supersedes | N/A |
| Related ADRs | ADR-011 (Authentication Strategy), ADR-012 (Data Isolation) |
Executive Summary
We will use GCP Workload Identity + Cloud SQL IAM Authentication as the database authentication mechanism for CODITECT Document Management System, providing:
- Zero Passwords: No database credentials to store, rotate, or leak
- Automatic Rotation: GCP manages all credential lifecycle
- IAM Integration: Fine-grained access control via GCP IAM
- Audit Trail: Complete authentication logging via Cloud Audit Logs
- Compliance: SOC 2, ISO 27001, HIPAA, FedRAMP compliant
Decision: Workload Identity + Cloud SQL IAM Auth over password-based authentication, Cloud SQL Proxy with service account keys, or application-level password management.
Context and Problem Statement
CODITECT DMS requires secure database authentication that:
- Eliminates credential exposure - No passwords in environment variables, secrets, or configuration files
- Provides automatic rotation - No manual credential rotation required
- Integrates with GCP IAM - Leverage existing identity infrastructure
- Supports audit compliance - Full authentication audit trail
- Works with async Python - Compatible with asyncpg for SQLAlchemy async operations
Current Problem
Password-based authentication presents several security challenges:
- URL Encoding Issues: Special characters in passwords (e.g.,
[,(,$,<) cause URL parsing failures - Credential Storage: Passwords stored in Kubernetes Secrets (base64, not encrypted by default)
- Rotation Complexity: Manual password rotation across multiple environments
- Exposure Risk: Environment variables visible in pod descriptions and logs
- Compliance Burden: Additional documentation required for password management
Requirements
Functional:
- Authenticate GKE workloads to Cloud SQL without passwords
- Support PostgreSQL with asyncpg driver
- Enable per-service-account access control
- Provide connection pooling compatibility
Non-Functional:
- Zero credential exposure
- < 100ms authentication latency
- 99.99% availability (GCP SLA)
- SOC 2 Type II compliant
- Automatic credential rotation
Security:
- No secrets in environment variables
- No secrets in Kubernetes manifests
- Full audit logging
- Principle of least privilege
Decision Drivers
- Security First: Eliminate all stored credentials
- GCP Native: Leverage platform identity infrastructure
- Operational Simplicity: Zero credential management overhead
- Compliance: Meet enterprise security requirements
- Developer Experience: Transparent authentication for applications
Considered Options
Option 1: Workload Identity + Cloud SQL IAM Authentication (CHOSEN)
Architecture:
GKE Pod → Workload Identity → GCP Service Account → IAM Auth → Cloud SQL
Pros:
- Zero Passwords: No credentials to store, rotate, or leak
- GCP Native: Direct integration with IAM, Cloud SQL, GKE
- Automatic Rotation: GCP manages token lifecycle (1-hour tokens)
- Fine-Grained Access: Per-service-account database permissions
- Audit Trail: Cloud Audit Logs capture all authentication events
- Compliance: Inherits GCP certifications (SOC 2, ISO 27001, HIPAA, FedRAMP)
- Transparent: Application code unchanged (uses standard connection string)
Cons:
- GKE Requirement: Only works with GKE Workload Identity
- Setup Complexity: Initial configuration more complex than passwords
- asyncpg Compatibility: Requires Cloud SQL Python Connector
- Latency: Initial connection ~50-100ms (token acquisition)
Cost:
- No additional cost (included with Cloud SQL and GKE)
Option 2: Cloud SQL Proxy with Service Account Key
Architecture:
GKE Pod → Cloud SQL Proxy Sidecar → Unix Socket → Cloud SQL
Pros:
- Established Pattern: Well-documented, widely used
- No Password in Connection String: Uses Unix socket
- Works with any asyncpg setup: Standard PostgreSQL connection
Cons:
- Service Account Key: JSON key file is a credential that can leak
- Key Rotation: Manual key rotation required
- Sidecar Overhead: Additional container per pod
- Resource Usage: Proxy uses CPU/memory
Cost:
- No additional cost, but operational overhead for key management
Option 3: Password-Based Authentication (Current)
Architecture:
GKE Pod → Environment Variable → Direct TCP Connection → Cloud SQL
Pros:
- Simple Setup: Standard PostgreSQL connection string
- Driver Agnostic: Works with any PostgreSQL driver
- No GCP Dependencies: Portable to other clouds
Cons:
- Password Storage: Credentials in Kubernetes Secrets
- URL Encoding Issues: Special characters cause parsing failures
- Rotation Burden: Manual password rotation
- Exposure Risk: Environment variables visible in pod descriptions
- Compliance Overhead: Additional documentation and controls required
Cost:
- Low direct cost, high operational/compliance cost
Option 4: HashiCorp Vault Dynamic Secrets
Architecture:
GKE Pod → Vault Agent → Dynamic PostgreSQL Credentials → Cloud SQL
Pros:
- Dynamic Credentials: Short-lived credentials generated on demand
- Automatic Rotation: Vault handles credential lifecycle
- Multi-Cloud: Works with any cloud provider
- Centralized: Single secrets management platform
Cons:
- Additional Infrastructure: Vault cluster required
- Complexity: Significant operational overhead
- Cost: Vault Enterprise licensing ($$$)
- Latency: Additional network hop for credential retrieval
Cost:
- Vault Enterprise: ~$30,000/year
- Infrastructure: ~$500/month (HA cluster)
Decision Outcome
Chosen Option: Workload Identity + Cloud SQL IAM Authentication
Rationale:
- Maximum Security: Eliminates all stored credentials - nothing to leak
- Zero Operational Overhead: GCP manages entire credential lifecycle
- Native Integration: Seamless with existing GKE and Cloud SQL infrastructure
- Compliance Ready: Inherits GCP's enterprise certifications
- Cost Effective: No additional licensing or infrastructure costs
- Future Proof: Google's recommended authentication pattern
Trade-offs Accepted:
- Vendor lock-in to GCP (acceptable given existing GCP commitment)
- Initial setup complexity (one-time cost)
- Requires Cloud SQL Python Connector (minimal code change)
Architecture
Authentication Flow
Component Architecture
Security Model
Implementation Details
Step 1: Enable Required APIs
# Enable required GCP APIs
gcloud services enable \
sqladmin.googleapis.com \
iam.googleapis.com \
iamcredentials.googleapis.com \
--project=coditect-cloud-infra
Step 2: Configure Workload Identity
# Variables
PROJECT_ID="coditect-cloud-infra"
CLUSTER_NAME="coditect-dms-dev-cluster"
REGION="us-central1"
K8S_NAMESPACE="coditect-dms"
K8S_SA="coditect-dms-api"
GCP_SA="coditect-dms-api-sa"
# Verify Workload Identity is enabled on cluster
gcloud container clusters describe $CLUSTER_NAME \
--region=$REGION \
--project=$PROJECT_ID \
--format="value(workloadIdentityConfig.workloadPool)"
# Create GCP Service Account (if not exists)
gcloud iam service-accounts create $GCP_SA \
--display-name="CODITECT DMS API Service Account" \
--project=$PROJECT_ID
# Grant Cloud SQL Client role
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"
# Allow K8s SA to impersonate GCP SA (Workload Identity binding)
gcloud iam service-accounts add-iam-policy-binding \
${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[${K8S_NAMESPACE}/${K8S_SA}]" \
--project=$PROJECT_ID
# Annotate Kubernetes Service Account
kubectl annotate serviceaccount $K8S_SA \
--namespace=$K8S_NAMESPACE \
iam.gke.io/gcp-service-account=${GCP_SA}@${PROJECT_ID}.iam.gserviceaccount.com
Step 3: Enable IAM Authentication on Cloud SQL
# Enable IAM authentication on Cloud SQL instance
gcloud sql instances patch coditect-dms-dev-db-0c7d1bf6 \
--database-flags=cloudsql.iam_authentication=on \
--project=$PROJECT_ID
# Create IAM database user
gcloud sql users create ${GCP_SA}@${PROJECT_ID}.iam \
--instance=coditect-dms-dev-db-0c7d1bf6 \
--type=CLOUD_IAM_SERVICE_ACCOUNT \
--project=$PROJECT_ID
Step 4: Grant PostgreSQL Permissions
-- Connect to Cloud SQL instance as postgres admin
-- Grant permissions to IAM user
-- Create application role (if not exists)
CREATE ROLE dms_app;
GRANT CONNECT ON DATABASE coditect_dms TO dms_app;
GRANT USAGE ON SCHEMA public TO dms_app;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO dms_app;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO dms_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO dms_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO dms_app;
-- Grant role to IAM user
GRANT dms_app TO "coditect-dms-api-sa@coditect-cloud-infra.iam";
Step 5: Update Application Code
# database.py - Updated for IAM Authentication
import os
from google.cloud.sql.connector import Connector, IPTypes
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
import asyncpg
# Configuration
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "coditect-cloud-infra")
REGION = os.environ.get("CLOUD_SQL_REGION", "us-central1")
INSTANCE_NAME = os.environ.get("CLOUD_SQL_INSTANCE", "coditect-dms-dev-db-0c7d1bf6")
DATABASE_NAME = os.environ.get("DATABASE_NAME", "coditect_dms")
IAM_USER = os.environ.get("IAM_DATABASE_USER", "coditect-dms-api-sa@coditect-cloud-infra.iam")
# Connection string for Cloud SQL with IAM auth
INSTANCE_CONNECTION_NAME = f"{PROJECT_ID}:{REGION}:{INSTANCE_NAME}"
async def get_iam_connection():
"""Get database connection using IAM authentication.
Uses Google Cloud SQL Python Connector for automatic IAM token management.
No passwords required - authentication via Workload Identity.
"""
connector = Connector()
async def getconn():
conn = await connector.connect_async(
INSTANCE_CONNECTION_NAME,
"asyncpg",
user=IAM_USER,
db=DATABASE_NAME,
enable_iam_auth=True,
ip_type=IPTypes.PRIVATE, # Use private IP within VPC
)
return conn
return getconn
async def create_engine_with_iam():
"""Create SQLAlchemy async engine with IAM authentication."""
getconn = await get_iam_connection()
engine = create_async_engine(
"postgresql+asyncpg://",
async_creator=getconn,
pool_size=5,
max_overflow=10,
pool_timeout=30,
pool_recycle=1800, # Recycle connections every 30 minutes
pool_pre_ping=True, # Verify connections before use
)
return engine
# Session factory
async def get_session_factory():
"""Get async session factory for database operations."""
engine = await create_engine_with_iam()
return sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False,
)
# FastAPI dependency
async def get_db():
"""FastAPI dependency for database sessions."""
session_factory = await get_session_factory()
async with session_factory() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise
finally:
await session.close()
Step 6: Update Kubernetes Deployment
# deployment.yaml - Updated for Workload Identity
apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-dms-api
namespace: coditect-dms
spec:
template:
spec:
serviceAccountName: coditect-dms-api # K8s SA with Workload Identity
containers:
- name: api
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-dms/api:dev-v3
env:
# No database password required!
- name: GCP_PROJECT_ID
value: "coditect-cloud-infra"
- name: CLOUD_SQL_REGION
value: "us-central1"
- name: CLOUD_SQL_INSTANCE
value: "coditect-dms-dev-db-0c7d1bf6"
- name: DATABASE_NAME
value: "coditect_dms"
- name: IAM_DATABASE_USER
value: "coditect-dms-api-sa@coditect-cloud-infra.iam"
# Other non-sensitive configuration
- name: API_ENVIRONMENT
value: "development"
- name: API_HOST
value: "0.0.0.0"
- name: API_PORT
value: "8000"
Step 7: Update requirements.txt
# requirements.txt - Add Cloud SQL Connector
cloud-sql-python-connector[asyncpg]>=1.0.0
google-auth>=2.0.0
Security Benefits
Credential Exposure Eliminated
| Attack Vector | Password Auth | IAM Auth |
|---|---|---|
| Secret in K8s manifest | Exposed | N/A |
| Env variable in pod | Exposed | N/A |
| Pod description leak | Exposed | N/A |
| Log file exposure | Possible | N/A |
| Memory dump | Possible | Token only (1hr) |
| Network sniffing | Possible (if not TLS) | Token only (1hr) |
Compliance Improvements
| Requirement | Password Auth | IAM Auth |
|---|---|---|
| SOC 2 Credential Management | Manual controls | Automated |
| ISO 27001 Access Control | Manual rotation | Automatic |
| HIPAA Access Logging | Partial | Complete |
| PCI DSS Key Rotation | 90-day manual | Continuous |
Audit Trail
IAM authentication provides complete audit logging via Cloud Audit Logs:
{
"logName": "projects/coditect-cloud-infra/logs/cloudaudit.googleapis.com%2Fdata_access",
"resource": {
"type": "cloudsql_database",
"labels": {
"database_id": "coditect-cloud-infra:coditect-dms-dev-db",
"region": "us-central1"
}
},
"protoPayload": {
"methodName": "cloudsql.instances.connect",
"authenticationInfo": {
"principalEmail": "coditect-dms-api-sa@coditect-cloud-infra.iam.gserviceaccount.com"
},
"requestMetadata": {
"callerIp": "10.1.6.26",
"callerSuppliedUserAgent": "cloud-sql-python-connector/1.0.0"
}
}
}
Rollback Plan
If IAM authentication encounters issues, rollback to password-based authentication:
- Revert
database.pyto use direct connection string - Recreate Kubernetes secret with URL-encoded password
- Update deployment to use password from secret
- Restart pods
# Emergency rollback script
kubectl --context=gke_coditect-cloud-infra_us-central1_coditect-dms-dev-cluster \
create secret generic coditect-dms-secrets \
--namespace=coditect-dms \
--from-literal=database-url="postgresql+asyncpg://..." \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment -n coditect-dms
Monitoring and Alerts
Key Metrics
# Prometheus alerts for IAM authentication
groups:
- name: cloudsql-iam-auth
rules:
- alert: CloudSQLIAMAuthFailure
expr: rate(cloudsql_database_auth_failures_total{auth_type="iam"}[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Cloud SQL IAM authentication failures"
description: "{{ $value }} IAM auth failures in last 5 minutes"
- alert: WorkloadIdentityTokenExpiry
expr: workload_identity_token_expiry_seconds < 300
for: 1m
labels:
severity: warning
annotations:
summary: "Workload Identity token expiring soon"
description: "Token expires in {{ $value }} seconds"
Dashboard
Create Cloud Monitoring dashboard for:
- IAM authentication success/failure rate
- Token refresh frequency
- Connection pool utilization
- Query latency by service account
Migration Checklist
Phase 1: Infrastructure Setup (Day 1)
- Verify Workload Identity enabled on GKE cluster
- Create GCP service account for DMS API
- Grant Cloud SQL Client role
- Configure Workload Identity binding
- Enable IAM authentication on Cloud SQL instance
- Create IAM database user
Phase 2: Application Update (Day 2)
- Add cloud-sql-python-connector to requirements.txt
- Update database.py with IAM authentication code
- Remove password from environment variables
- Update Kubernetes deployment manifest
- Build and push new Docker image
Phase 3: Deployment (Day 3)
- Deploy updated application
- Verify database connectivity
- Run health checks
- Monitor for authentication errors
- Update documentation
Phase 4: Cleanup (Day 4)
- Remove password from GCP Secret Manager
- Delete Kubernetes secret (coditect-dms-secrets database-url)
- Update security documentation
- Archive old credentials securely
Open Questions
- Connection Pooling: Should we use Cloud SQL connection pooling in addition to SQLAlchemy pooling? (Performance testing: Week 1)
- Multi-Region: How does IAM auth work with Cloud SQL multi-region replicas? (Architecture review: Week 2)
- Local Development: Best approach for local development without GKE? (Developer experience: Week 1)
References
- Cloud SQL IAM Database Authentication
- Workload Identity for GKE
- Cloud SQL Python Connector
- GKE Workload Identity Best Practices
- Cloud SQL Security Best Practices
Decision: Proposed for approval Next Steps: Implement Phase 1 infrastructure setup Review Date: January 5, 2026