Skip to main content

Agent Skills Framework Extension

Cloud Infrastructure Patterns Skill

When to Use This Skill

Use this skill when implementing cloud infrastructure patterns patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Multi-cloud infrastructure design with Infrastructure as Code, high availability, and cost optimization.

Core Capabilities

  1. Infrastructure as Code - Terraform, Pulumi, CloudFormation
  2. Multi-Cloud Design - GCP, AWS, Azure architecture patterns
  3. High Availability - Regional redundancy, failover, load balancing
  4. Cost Optimization - Resource right-sizing, commitment planning
  5. Security & Compliance - IAM, encryption, audit logging

Cloud Provider Selection Matrix

ScenarioGCPAWSAzureRecommendation
Kubernetes-first✅ GKE (best managed K8s)EKSAKSGCP
Data analytics/ML✅ BigQuery, Vertex AIRedshift, SageMakerSynapse, Azure MLGCP (data), AWS (ML breadth)
Enterprise/Microsoft stackCloud SQLRDS✅ Azure SQL, AD integrationAzure
Serverless functionsCloud Functions✅ Lambda (most mature)Azure FunctionsAWS
Global edge/CDNCloud CDN✅ CloudFrontAzure CDNAWS
Cost sensitivitySustained use discounts✅ Spot instances (cheapest)Reserved instancesAWS (spot), GCP (sustained)
Compliance (HIPAA/FedRAMP)✅ Strong✅ Strongest (GovCloud)✅ StrongAWS (GovCloud)
Startup credits$100K+ for startups$100K for startups$150K BizSparkAzure (BizSpark)

Quick Decision:

What's your primary workload?
├── Containers/Kubernetes → GCP (GKE)
├── Serverless/Event-driven → AWS (Lambda)
├── Microsoft/.NET/Enterprise → Azure
├── Data warehouse/Analytics → GCP (BigQuery)
├── Multi-region global app → AWS (most regions)
└── Cost-constrained startup → Compare credits + spot pricing

Terraform Multi-Region GCP Infrastructure

# infrastructure/terraform/main.tf
terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}

backend "gcs" {
bucket = "terraform-state-prod"
prefix = "infrastructure/state"
}
}

# Variables
variable "project_id" {
description = "GCP project ID"
type = string
}

variable "regions" {
description = "Deployment regions"
type = list(string)
default = ["us-central1", "us-east1", "europe-west1"]
}

variable "environment" {
description = "Environment name"
type = string
}

# VPC Network with Multi-Region
resource "google_compute_network" "vpc" {
name = "${var.environment}-vpc"
auto_create_subnetworks = false
project = var.project_id
}

# Subnet in each region
resource "google_compute_subnetwork" "subnets" {
for_each = toset(var.regions)

name = "${var.environment}-subnet-${each.key}"
network = google_compute_network.vpc.id
region = each.key
ip_cidr_range = cidrsubnet("10.0.0.0/16", 8, index(var.regions, each.key))
project = var.project_id

secondary_ip_range {
range_name = "gke-pods"
ip_cidr_range = cidrsubnet("10.1.0.0/16", 8, index(var.regions, each.key))
}

secondary_ip_range {
range_name = "gke-services"
ip_cidr_range = cidrsubnet("10.2.0.0/16", 8, index(var.regions, each.key))
}

log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
}

# Cloud NAT for private instances
resource "google_compute_router" "router" {
for_each = toset(var.regions)

name = "${var.environment}-router-${each.key}"
network = google_compute_network.vpc.id
region = each.key
project = var.project_id
}

resource "google_compute_router_nat" "nat" {
for_each = toset(var.regions)

name = "${var.environment}-nat-${each.key}"
router = google_compute_router.router[each.key].name
region = each.key
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
project = var.project_id

log_config {
enable = true
filter = "ERRORS_ONLY"
}
}

# GKE Cluster per region
resource "google_container_cluster" "primary" {
for_each = toset(var.regions)

name = "${var.environment}-gke-${each.key}"
location = each.key
project = var.project_id

# Use VPC-native cluster
network = google_compute_network.vpc.id
subnetwork = google_compute_subnetwork.subnets[each.key].id

# Remove default node pool
remove_default_node_pool = true
initial_node_count = 1

# Workload Identity for pod-level IAM
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}

# Network policy
network_policy {
enabled = true
provider = "PROVIDER_UNSPECIFIED"
}

# IP allocation policy for VPC-native
ip_allocation_policy {
cluster_secondary_range_name = "gke-pods"
services_secondary_range_name = "gke-services"
}

# Master authorized networks
master_authorized_networks_config {
cidr_blocks {
cidr_block = "0.0.0.0/0"
display_name = "All networks (production should restrict)"
}
}

# Private cluster configuration
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = cidrsubnet("172.16.0.0/16", 8, index(var.regions, each.key))
}

# Maintenance window
maintenance_policy {
daily_maintenance_window {
start_time = "03:00"
}
}

# Binary authorization
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}

# Logging and monitoring
logging_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}

monitoring_config {
enable_components = ["SYSTEM_COMPONENTS"]
managed_prometheus {
enabled = true
}
}
}

# GKE Node Pool with autoscaling
resource "google_container_node_pool" "primary_nodes" {
for_each = toset(var.regions)

name = "${var.environment}-node-pool-${each.key}"
location = each.key
cluster = google_container_cluster.primary[each.key].name
node_count = 1
project = var.project_id

autoscaling {
min_node_count = 1
max_node_count = 10
}

node_config {
preemptible = var.environment != "production"
machine_type = "n2-standard-4"

# Workload Identity
workload_metadata_config {
mode = "GKE_METADATA"
}

oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]

labels = {
environment = var.environment
region = each.key
}

tags = ["gke-node", var.environment]

# Shielded instance
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}

# Spot instances for non-prod
dynamic "spot" {
for_each = var.environment != "production" ? [1] : []
content {
spot = true
}
}
}

management {
auto_repair = true
auto_upgrade = true
}
}

# Cloud SQL PostgreSQL with HA
resource "google_sql_database_instance" "postgres" {
name = "${var.environment}-postgres"
database_version = "POSTGRES_15"
region = var.regions[0]
project = var.project_id

settings {
tier = "db-custom-4-16384"
availability_type = "REGIONAL" # HA with failover replica

backup_configuration {
enabled = true
start_time = "03:00"
point_in_time_recovery_enabled = true
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
}
}

ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.id
require_ssl = true
}

database_flags {
name = "max_connections"
value = "200"
}

database_flags {
name = "log_checkpoints"
value = "on"
}

insights_config {
query_insights_enabled = true
query_string_length = 1024
record_application_tags = true
}

maintenance_window {
day = 7 # Sunday
hour = 3
update_track = "stable"
}
}

deletion_protection = var.environment == "production"
}

# Cloud Storage bucket with versioning
resource "google_storage_bucket" "artifacts" {
name = "${var.project_id}-${var.environment}-artifacts"
location = "US"
project = var.project_id
force_destroy = var.environment != "production"

uniform_bucket_level_access = true

versioning {
enabled = true
}

lifecycle_rule {
condition {
age = 90
}
action {
type = "Delete"
}
}

lifecycle_rule {
condition {
num_newer_versions = 3
}
action {
type = "Delete"
}
}

encryption {
default_kms_key_name = google_kms_crypto_key.storage.id
}
}

# Cloud KMS for encryption
resource "google_kms_key_ring" "main" {
name = "${var.environment}-keyring"
location = var.regions[0]
project = var.project_id
}

resource "google_kms_crypto_key" "storage" {
name = "storage-encryption-key"
key_ring = google_kms_key_ring.main.id
rotation_period = "7776000s" # 90 days

lifecycle {
prevent_destroy = true
}
}

# Load Balancer with CDN
resource "google_compute_global_address" "lb_ip" {
name = "${var.environment}-lb-ip"
project = var.project_id
}

resource "google_compute_backend_service" "backend" {
name = "${var.environment}-backend"
protocol = "HTTP"
port_name = "http"
timeout_sec = 30
enable_cdn = true
project = var.project_id

cdn_policy {
cache_mode = "CACHE_ALL_STATIC"
default_ttl = 3600
max_ttl = 86400
negative_caching = true
}

health_checks = [google_compute_health_check.http.id]

log_config {
enable = true
sample_rate = 1.0
}
}

resource "google_compute_health_check" "http" {
name = "${var.environment}-health-check"
check_interval_sec = 10
timeout_sec = 5
project = var.project_id

http_health_check {
port = 8080
request_path = "/health"
}
}

# Outputs
output "vpc_id" {
value = google_compute_network.vpc.id
}

output "gke_clusters" {
value = {
for region, cluster in google_container_cluster.primary :
region => cluster.endpoint
}
sensitive = true
}

output "sql_connection" {
value = google_sql_database_instance.postgres.connection_name
sensitive = true
}

output "lb_ip" {
value = google_compute_global_address.lb_ip.address
}

Cost Optimization Module

# infrastructure/cost_optimization.py
"""
Cloud cost optimization and resource right-sizing
"""
from dataclasses import dataclass
from typing import Dict, List
import pandas as pd
from google.cloud import billing_v1, monitoring_v3
from datetime import datetime, timedelta

@dataclass
class ResourceRecommendation:
"""Cost optimization recommendation"""
resource_type: str
resource_name: str
current_cost_monthly: float
recommended_cost_monthly: float
savings_monthly: float
savings_annual: float
action: str
confidence: float
details: Dict[str, any]

class CostOptimizer:
"""Analyze and optimize cloud infrastructure costs"""

def __init__(self, project_id: str):
self.project_id = project_id
self.billing_client = billing_v1.CloudBillingClient()
self.monitoring_client = monitoring_v3.QueryServiceClient()

def analyze_compute_utilization(self) -> List[ResourceRecommendation]:
"""Analyze VM and GKE node utilization"""
recommendations = []

# Query CPU utilization for past 30 days
query = """
fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 1d, [value_utilization_mean: mean(value.utilization)]
| within 30d
"""

results = self._execute_mql_query(query)

for instance in results:
avg_cpu = instance['cpu_utilization']
current_machine_type = instance['machine_type']
current_cost = self._get_machine_type_cost(current_machine_type)

# Recommend downsize if CPU < 30%
if avg_cpu < 0.30:
recommended_type = self._downsize_machine_type(current_machine_type)
recommended_cost = self._get_machine_type_cost(recommended_type)

savings_monthly = current_cost - recommended_cost

recommendations.append(ResourceRecommendation(
resource_type="compute_instance",
resource_name=instance['name'],
current_cost_monthly=current_cost,
recommended_cost_monthly=recommended_cost,
savings_monthly=savings_monthly,
savings_annual=savings_monthly * 12,
action=f"Downsize from {current_machine_type} to {recommended_type}",
confidence=0.9 if avg_cpu < 0.20 else 0.7,
details={
'avg_cpu_utilization': avg_cpu,
'current_machine_type': current_machine_type,
'recommended_machine_type': recommended_type
}
))

return recommendations

def analyze_storage_usage(self) -> List[ResourceRecommendation]:
"""Analyze storage class optimization opportunities"""
recommendations = []

# Check for buckets with infrequently accessed data
query = """
fetch gcs_bucket
| metric 'storage.googleapis.com/storage/total_bytes'
| group_by 7d, [value_total_bytes_mean: mean(value.total_bytes)]
| within 90d
"""

# Also query access patterns
access_query = """
fetch gcs_bucket
| metric 'storage.googleapis.com/api/request_count'
| filter metric.method == 'ReadObject'
| group_by 30d, [value_request_count_sum: sum(value.request_count)]
| within 90d
"""

storage_results = self._execute_mql_query(query)
access_results = self._execute_mql_query(access_query)

for bucket in storage_results:
bucket_name = bucket['bucket_name']
size_gb = bucket['total_bytes'] / (1024**3)
access_count = access_results.get(bucket_name, {}).get('request_count', 0)

# Recommend Nearline/Coldline for infrequently accessed data
if access_count < 10 and size_gb > 100: # < 10 accesses/month
current_cost = size_gb * 0.020 # Standard storage $0.020/GB
recommended_cost = size_gb * 0.010 # Nearline $0.010/GB

recommendations.append(ResourceRecommendation(
resource_type="gcs_bucket",
resource_name=bucket_name,
current_cost_monthly=current_cost,
recommended_cost_monthly=recommended_cost,
savings_monthly=current_cost - recommended_cost,
savings_annual=(current_cost - recommended_cost) * 12,
action="Migrate to Nearline storage class",
confidence=0.85,
details={
'size_gb': size_gb,
'access_count_30d': access_count,
'current_class': 'STANDARD',
'recommended_class': 'NEARLINE'
}
))

return recommendations

def analyze_commitment_opportunities(self) -> List[ResourceRecommendation]:
"""Identify resources eligible for committed use discounts"""
recommendations = []

# Find stable workloads running for >30 days
query = """
fetch gce_instance
| filter resource.instance_id != ''
| within 90d
"""

results = self._execute_mql_query(query)

for instance in results:
uptime_days = instance['uptime_days']

if uptime_days > 30: # Stable workload
machine_type = instance['machine_type']
on_demand_cost = self._get_machine_type_cost(machine_type)
committed_cost = on_demand_cost * 0.70 # 30% discount for 1-year commitment

savings_monthly = on_demand_cost - committed_cost

recommendations.append(ResourceRecommendation(
resource_type="commitment",
resource_name=instance['name'],
current_cost_monthly=on_demand_cost,
recommended_cost_monthly=committed_cost,
savings_monthly=savings_monthly,
savings_annual=savings_monthly * 12,
action="Purchase 1-year committed use discount",
confidence=0.95,
details={
'machine_type': machine_type,
'uptime_days': uptime_days,
'discount_percentage': 30
}
))

return recommendations

def generate_report(self) -> pd.DataFrame:
"""Generate comprehensive cost optimization report"""
all_recommendations = []

all_recommendations.extend(self.analyze_compute_utilization())
all_recommendations.extend(self.analyze_storage_usage())
all_recommendations.extend(self.analyze_commitment_opportunities())

# Convert to DataFrame
df = pd.DataFrame([
{
'Resource Type': r.resource_type,
'Resource Name': r.resource_name,
'Current Cost/Month': f"${r.current_cost_monthly:,.2f}",
'Recommended Cost/Month': f"${r.recommended_cost_monthly:,.2f}",
'Savings/Month': f"${r.savings_monthly:,.2f}",
'Savings/Year': f"${r.savings_annual:,.2f}",
'Action': r.action,
'Confidence': f"{r.confidence * 100:.0f}%"
}
for r in all_recommendations
])

# Sort by annual savings descending
df = df.sort_values('Savings/Year', ascending=False)

return df

def _execute_mql_query(self, query: str) -> List[Dict]:
"""Execute Monitoring Query Language query"""
# Simplified - actual implementation would use monitoring API
return []

def _get_machine_type_cost(self, machine_type: str) -> float:
"""Get monthly cost for machine type"""
# Simplified pricing - actual would use Billing API
pricing = {
'n1-standard-1': 24.27,
'n1-standard-2': 48.55,
'n2-standard-2': 56.50,
'n2-standard-4': 113.00
}
return pricing.get(machine_type, 0)

def _downsize_machine_type(self, current: str) -> str:
"""Recommend smaller machine type"""
downsize_map = {
'n2-standard-4': 'n2-standard-2',
'n2-standard-2': 'n1-standard-1'
}
return downsize_map.get(current, current)

Usage Examples

Multi-Region Infrastructure

Apply cloud-infrastructure-patterns skill to design GCP multi-region architecture with Terraform

Cost Optimization

Apply cloud-infrastructure-patterns skill to analyze GCP costs and generate optimization recommendations

High Availability Setup

Apply cloud-infrastructure-patterns skill to implement regional HA with automatic failover

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: cloud-infrastructure-patterns

Completed:
- [x] Multi-region VPC with subnets in 3 regions configured
- [x] GKE clusters deployed across all regions with autoscaling
- [x] Cloud SQL PostgreSQL with regional HA and PITR enabled
- [x] Cloud Storage bucket with lifecycle policies and KMS encryption
- [x] Load balancer with CDN and health checks operational
- [x] Cloud NAT for private instance egress configured

Outputs:
- infrastructure/terraform/main.tf (Multi-region infrastructure as code)
- infrastructure/terraform/variables.tf (Configurable parameters)
- infrastructure/terraform/outputs.tf (Resource identifiers and endpoints)
- infrastructure/cost_optimization.py (Cost analysis and recommendations)
- docs/architecture-diagram.png (C4 context and container diagrams)
- docs/runbook.md (Operational procedures)

Infrastructure Metrics:
- High availability: 99.95% uptime SLA (regional failover tested)
- Cost optimization: $12,450/month saved via commitment discounts
- Security: All resources in private subnets, RLS enabled, KMS encrypted
- Scalability: Autoscaling handles 10x traffic spikes

Completion Checklist

Before marking this skill as complete, verify:

  • Terraform configuration validates with terraform validate
  • VPC network created with auto_create_subnetworks = false
  • Subnets provisioned in all target regions with secondary ranges for GKE
  • Cloud NAT configured for private instance internet access
  • GKE clusters use VPC-native networking (IP allocation policy)
  • GKE workload identity enabled for pod-level IAM
  • Node pools configured with autoscaling (min/max nodes)
  • Cloud SQL configured with REGIONAL availability (HA replica)
  • Point-in-time recovery enabled on Cloud SQL (7-day transaction logs)
  • Cloud Storage bucket uses KMS encryption with 90-day key rotation
  • Lifecycle policies delete old versions and aged objects
  • Load balancer health checks validate backend availability
  • CDN enabled for static content with cache policies
  • Cost optimization script identifies underutilized resources
  • All outputs exist at expected locations and pass validation

Failure Indicators

This skill has FAILED if:

  • ❌ Terraform apply fails with validation or dependency errors
  • ❌ VPC subnets have overlapping CIDR ranges
  • ❌ GKE clusters cannot reach internet (Cloud NAT misconfigured)
  • ❌ GKE pods cannot pull images (workload identity issue)
  • ❌ Node pools do not autoscale under load
  • ❌ Cloud SQL configured as ZONAL (no HA failover)
  • ❌ Point-in-time recovery disabled (data loss risk)
  • ❌ Cloud Storage bucket uses default encryption (not KMS)
  • ❌ Lifecycle policies missing (unbounded storage growth)
  • ❌ Load balancer health checks always succeed (not validating backend)
  • ❌ CDN disabled or misconfigured (high latency for static content)
  • ❌ Cost optimization script finds no savings (incomplete analysis)

When NOT to Use

Do NOT use this skill when:

  • Single-region deployment sufficient (no multi-region HA required)
  • Using serverless platforms (Cloud Run, Cloud Functions) instead of GKE
  • Infrastructure already provisioned and managed externally
  • Development/staging environments (use simpler, cheaper configurations)
  • Prototyping or proof-of-concept (overhead not justified)
  • Using managed Kubernetes services from other providers (EKS, AKS)
  • Infrastructure managed via UI console (not IaC)

Alternative approaches:

  • Single region: Remove region loop, deploy to one region only
  • Serverless: Use Cloud Run with auto-scaling and managed load balancing
  • Existing infra: Import resources into Terraform state
  • Dev/staging: Use preemptible VMs, smaller machine types, disable HA
  • Other providers: Adapt patterns to AWS (EKS, RDS) or Azure (AKS, PostgreSQL)

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Manual infrastructure changesConfiguration drift, not reproducibleAlways update Terraform, never modify via console
No state file lockingConcurrent applies corrupt stateUse GCS backend with state locking
Hardcoded values in .tf filesNot reusable across environmentsUse variables.tf and terraform.tfvars
No cost monitoringRunaway spend undetectedEnable billing alerts, run cost optimization script
Public subnets for all resourcesSecurity riskUse private subnets + Cloud NAT for internet egress
No HA for databasesSingle point of failureConfigure REGIONAL availability with replicas
Unencrypted storageCompliance violationUse KMS encryption for all storage
No lifecycle policiesStorage costs grow unboundedImplement retention and deletion policies
Missing health checksLoad balancer routes to unhealthy backendsConfigure HTTP health checks on /health endpoint
No autoscalingOver-provision or under-provisionEnable cluster and node pool autoscaling

Principles

This skill embodies:

  • #1 Infrastructure as Code - All resources defined in version-controlled Terraform
  • #2 High Availability First - Regional redundancy, automatic failover, HA databases
  • #3 Security by Default - Private subnets, KMS encryption, workload identity, RLS
  • #4 Cost Optimization - Automated analysis identifies savings (commitments, right-sizing)
  • #5 Eliminate Ambiguity - Explicit resource configuration; no default assumptions
  • #6 Clear, Understandable, Explainable - Well-documented Terraform with output values
  • #8 No Assumptions - Validate all infrastructure with health checks and monitoring
  • #10 Automation First - Terraform provisions everything; no manual console clicks
  • #11 Observability - Logging, monitoring, and alerting configured for all resources

Full Standard: CODITECT-STANDARD-AUTOMATION.md


Integration Points

  • k8s-statefulset-patterns - Kubernetes deployments
  • deployment-strategy-patterns - Deployment automation
  • cicd-automation-patterns - Infrastructure CI/CD