Agent Skills Framework Extension

Cloud Infrastructure Patterns Skill

When to Use This Skill

Use this skill when implementing cloud infrastructure patterns patterns in your codebase.

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Multi-cloud infrastructure design with Infrastructure as Code, high availability, and cost optimization.

Core Capabilities

Infrastructure as Code - Terraform, Pulumi, CloudFormation
Multi-Cloud Design - GCP, AWS, Azure architecture patterns
High Availability - Regional redundancy, failover, load balancing
Cost Optimization - Resource right-sizing, commitment planning
Security & Compliance - IAM, encryption, audit logging

Cloud Provider Selection Matrix

Scenario	GCP	AWS	Azure	Recommendation
Kubernetes-first	✅ GKE (best managed K8s)	EKS	AKS	GCP
Data analytics/ML	✅ BigQuery, Vertex AI	Redshift, SageMaker	Synapse, Azure ML	GCP (data), AWS (ML breadth)
Enterprise/Microsoft stack	Cloud SQL	RDS	✅ Azure SQL, AD integration	Azure
Serverless functions	Cloud Functions	✅ Lambda (most mature)	Azure Functions	AWS
Global edge/CDN	Cloud CDN	✅ CloudFront	Azure CDN	AWS
Cost sensitivity	Sustained use discounts	✅ Spot instances (cheapest)	Reserved instances	AWS (spot), GCP (sustained)
Compliance (HIPAA/FedRAMP)	✅ Strong	✅ Strongest (GovCloud)	✅ Strong	AWS (GovCloud)
Startup credits	$100K+ for startups	$100K for startups	$150K BizSpark	Azure (BizSpark)

Quick Decision:

What's your primary workload?
├── Containers/Kubernetes → GCP (GKE)
├── Serverless/Event-driven → AWS (Lambda)
├── Microsoft/.NET/Enterprise → Azure
├── Data warehouse/Analytics → GCP (BigQuery)
├── Multi-region global app → AWS (most regions)
└── Cost-constrained startup → Compare credits + spot pricing

Terraform Multi-Region GCP Infrastructure

# infrastructure/terraform/main.tf
terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }

  backend "gcs" {
    bucket = "terraform-state-prod"
    prefix = "infrastructure/state"
  }
}

# Variables
variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "regions" {
  description = "Deployment regions"
  type        = list(string)
  default     = ["us-central1", "us-east1", "europe-west1"]
}

variable "environment" {
  description = "Environment name"
  type        = string
}

# VPC Network with Multi-Region
resource "google_compute_network" "vpc" {
  name                    = "${var.environment}-vpc"
  auto_create_subnetworks = false
  project                 = var.project_id
}

# Subnet in each region
resource "google_compute_subnetwork" "subnets" {
  for_each = toset(var.regions)

  name          = "${var.environment}-subnet-${each.key}"
  network       = google_compute_network.vpc.id
  region        = each.key
  ip_cidr_range = cidrsubnet("10.0.0.0/16", 8, index(var.regions, each.key))
  project       = var.project_id

  secondary_ip_range {
    range_name    = "gke-pods"
    ip_cidr_range = cidrsubnet("10.1.0.0/16", 8, index(var.regions, each.key))
  }

  secondary_ip_range {
    range_name    = "gke-services"
    ip_cidr_range = cidrsubnet("10.2.0.0/16", 8, index(var.regions, each.key))
  }

  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

# Cloud NAT for private instances
resource "google_compute_router" "router" {
  for_each = toset(var.regions)

  name    = "${var.environment}-router-${each.key}"
  network = google_compute_network.vpc.id
  region  = each.key
  project = var.project_id
}

resource "google_compute_router_nat" "nat" {
  for_each = toset(var.regions)

  name                               = "${var.environment}-nat-${each.key}"
  router                             = google_compute_router.router[each.key].name
  region                             = each.key
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
  project                            = var.project_id

  log_config {
    enable = true
    filter = "ERRORS_ONLY"
  }
}

# GKE Cluster per region
resource "google_container_cluster" "primary" {
  for_each = toset(var.regions)

  name     = "${var.environment}-gke-${each.key}"
  location = each.key
  project  = var.project_id

  # Use VPC-native cluster
  network    = google_compute_network.vpc.id
  subnetwork = google_compute_subnetwork.subnets[each.key].id

  # Remove default node pool
  remove_default_node_pool = true
  initial_node_count       = 1

  # Workload Identity for pod-level IAM
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  # Network policy
  network_policy {
    enabled  = true
    provider = "PROVIDER_UNSPECIFIED"
  }

  # IP allocation policy for VPC-native
  ip_allocation_policy {
    cluster_secondary_range_name  = "gke-pods"
    services_secondary_range_name = "gke-services"
  }

  # Master authorized networks
  master_authorized_networks_config {
    cidr_blocks {
      cidr_block   = "0.0.0.0/0"
      display_name = "All networks (production should restrict)"
    }
  }

  # Private cluster configuration
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = cidrsubnet("172.16.0.0/16", 8, index(var.regions, each.key))
  }

  # Maintenance window
  maintenance_policy {
    daily_maintenance_window {
      start_time = "03:00"
    }
  }

  # Binary authorization
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }

  # Logging and monitoring
  logging_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
  }

  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS"]
    managed_prometheus {
      enabled = true
    }
  }
}

# GKE Node Pool with autoscaling
resource "google_container_node_pool" "primary_nodes" {
  for_each = toset(var.regions)

  name       = "${var.environment}-node-pool-${each.key}"
  location   = each.key
  cluster    = google_container_cluster.primary[each.key].name
  node_count = 1
  project    = var.project_id

  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }

  node_config {
    preemptible  = var.environment != "production"
    machine_type = "n2-standard-4"

    # Workload Identity
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    labels = {
      environment = var.environment
      region      = each.key
    }

    tags = ["gke-node", var.environment]

    # Shielded instance
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }

    # Spot instances for non-prod
    dynamic "spot" {
      for_each = var.environment != "production" ? [1] : []
      content {
        spot = true
      }
    }
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

# Cloud SQL PostgreSQL with HA
resource "google_sql_database_instance" "postgres" {
  name             = "${var.environment}-postgres"
  database_version = "POSTGRES_15"
  region           = var.regions[0]
  project          = var.project_id

  settings {
    tier              = "db-custom-4-16384"
    availability_type = "REGIONAL"  # HA with failover replica

    backup_configuration {
      enabled                        = true
      start_time                     = "03:00"
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7
      backup_retention_settings {
        retained_backups = 30
      }
    }

    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.vpc.id
      require_ssl     = true
    }

    database_flags {
      name  = "max_connections"
      value = "200"
    }

    database_flags {
      name  = "log_checkpoints"
      value = "on"
    }

    insights_config {
      query_insights_enabled  = true
      query_string_length     = 1024
      record_application_tags = true
    }

    maintenance_window {
      day          = 7  # Sunday
      hour         = 3
      update_track = "stable"
    }
  }

  deletion_protection = var.environment == "production"
}

# Cloud Storage bucket with versioning
resource "google_storage_bucket" "artifacts" {
  name          = "${var.project_id}-${var.environment}-artifacts"
  location      = "US"
  project       = var.project_id
  force_destroy = var.environment != "production"

  uniform_bucket_level_access = true

  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition {
      age = 90
    }
    action {
      type = "Delete"
    }
  }

  lifecycle_rule {
    condition {
      num_newer_versions = 3
    }
    action {
      type = "Delete"
    }
  }

  encryption {
    default_kms_key_name = google_kms_crypto_key.storage.id
  }
}

# Cloud KMS for encryption
resource "google_kms_key_ring" "main" {
  name     = "${var.environment}-keyring"
  location = var.regions[0]
  project  = var.project_id
}

resource "google_kms_crypto_key" "storage" {
  name            = "storage-encryption-key"
  key_ring        = google_kms_key_ring.main.id
  rotation_period = "7776000s"  # 90 days

  lifecycle {
    prevent_destroy = true
  }
}

# Load Balancer with CDN
resource "google_compute_global_address" "lb_ip" {
  name    = "${var.environment}-lb-ip"
  project = var.project_id
}

resource "google_compute_backend_service" "backend" {
  name                  = "${var.environment}-backend"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 30
  enable_cdn            = true
  project               = var.project_id

  cdn_policy {
    cache_mode        = "CACHE_ALL_STATIC"
    default_ttl       = 3600
    max_ttl           = 86400
    negative_caching  = true
  }

  health_checks = [google_compute_health_check.http.id]

  log_config {
    enable      = true
    sample_rate = 1.0
  }
}

resource "google_compute_health_check" "http" {
  name               = "${var.environment}-health-check"
  check_interval_sec = 10
  timeout_sec        = 5
  project            = var.project_id

  http_health_check {
    port         = 8080
    request_path = "/health"
  }
}

# Outputs
output "vpc_id" {
  value = google_compute_network.vpc.id
}

output "gke_clusters" {
  value = {
    for region, cluster in google_container_cluster.primary :
    region => cluster.endpoint
  }
  sensitive = true
}

output "sql_connection" {
  value     = google_sql_database_instance.postgres.connection_name
  sensitive = true
}

output "lb_ip" {
  value = google_compute_global_address.lb_ip.address
}

Cost Optimization Module

# infrastructure/cost_optimization.py
"""
Cloud cost optimization and resource right-sizing
"""
from dataclasses import dataclass
from typing import Dict, List
import pandas as pd
from google.cloud import billing_v1, monitoring_v3
from datetime import datetime, timedelta

@dataclass
class ResourceRecommendation:
    """Cost optimization recommendation"""
    resource_type: str
    resource_name: str
    current_cost_monthly: float
    recommended_cost_monthly: float
    savings_monthly: float
    savings_annual: float
    action: str
    confidence: float
    details: Dict[str, any]

class CostOptimizer:
    """Analyze and optimize cloud infrastructure costs"""

    def __init__(self, project_id: str):
        self.project_id = project_id
        self.billing_client = billing_v1.CloudBillingClient()
        self.monitoring_client = monitoring_v3.QueryServiceClient()

    def analyze_compute_utilization(self) -> List[ResourceRecommendation]:
        """Analyze VM and GKE node utilization"""
        recommendations = []

        # Query CPU utilization for past 30 days
        query = """
        fetch gce_instance
        | metric 'compute.googleapis.com/instance/cpu/utilization'
        | group_by 1d, [value_utilization_mean: mean(value.utilization)]
        | within 30d
        """

        results = self._execute_mql_query(query)

        for instance in results:
            avg_cpu = instance['cpu_utilization']
            current_machine_type = instance['machine_type']
            current_cost = self._get_machine_type_cost(current_machine_type)

            # Recommend downsize if CPU < 30%
            if avg_cpu < 0.30:
                recommended_type = self._downsize_machine_type(current_machine_type)
                recommended_cost = self._get_machine_type_cost(recommended_type)

                savings_monthly = current_cost - recommended_cost

                recommendations.append(ResourceRecommendation(
                    resource_type="compute_instance",
                    resource_name=instance['name'],
                    current_cost_monthly=current_cost,
                    recommended_cost_monthly=recommended_cost,
                    savings_monthly=savings_monthly,
                    savings_annual=savings_monthly * 12,
                    action=f"Downsize from {current_machine_type} to {recommended_type}",
                    confidence=0.9 if avg_cpu < 0.20 else 0.7,
                    details={
                        'avg_cpu_utilization': avg_cpu,
                        'current_machine_type': current_machine_type,
                        'recommended_machine_type': recommended_type
                    }
                ))

        return recommendations

    def analyze_storage_usage(self) -> List[ResourceRecommendation]:
        """Analyze storage class optimization opportunities"""
        recommendations = []

        # Check for buckets with infrequently accessed data
        query = """
        fetch gcs_bucket
        | metric 'storage.googleapis.com/storage/total_bytes'
        | group_by 7d, [value_total_bytes_mean: mean(value.total_bytes)]
        | within 90d
        """

        # Also query access patterns
        access_query = """
        fetch gcs_bucket
        | metric 'storage.googleapis.com/api/request_count'
        | filter metric.method == 'ReadObject'
        | group_by 30d, [value_request_count_sum: sum(value.request_count)]
        | within 90d
        """

        storage_results = self._execute_mql_query(query)
        access_results = self._execute_mql_query(access_query)

        for bucket in storage_results:
            bucket_name = bucket['bucket_name']
            size_gb = bucket['total_bytes'] / (1024**3)
            access_count = access_results.get(bucket_name, {}).get('request_count', 0)

            # Recommend Nearline/Coldline for infrequently accessed data
            if access_count < 10 and size_gb > 100:  # < 10 accesses/month
                current_cost = size_gb * 0.020  # Standard storage $0.020/GB
                recommended_cost = size_gb * 0.010  # Nearline $0.010/GB

                recommendations.append(ResourceRecommendation(
                    resource_type="gcs_bucket",
                    resource_name=bucket_name,
                    current_cost_monthly=current_cost,
                    recommended_cost_monthly=recommended_cost,
                    savings_monthly=current_cost - recommended_cost,
                    savings_annual=(current_cost - recommended_cost) * 12,
                    action="Migrate to Nearline storage class",
                    confidence=0.85,
                    details={
                        'size_gb': size_gb,
                        'access_count_30d': access_count,
                        'current_class': 'STANDARD',
                        'recommended_class': 'NEARLINE'
                    }
                ))

        return recommendations

    def analyze_commitment_opportunities(self) -> List[ResourceRecommendation]:
        """Identify resources eligible for committed use discounts"""
        recommendations = []

        # Find stable workloads running for >30 days
        query = """
        fetch gce_instance
        | filter resource.instance_id != ''
        | within 90d
        """

        results = self._execute_mql_query(query)

        for instance in results:
            uptime_days = instance['uptime_days']

            if uptime_days > 30:  # Stable workload
                machine_type = instance['machine_type']
                on_demand_cost = self._get_machine_type_cost(machine_type)
                committed_cost = on_demand_cost * 0.70  # 30% discount for 1-year commitment

                savings_monthly = on_demand_cost - committed_cost

                recommendations.append(ResourceRecommendation(
                    resource_type="commitment",
                    resource_name=instance['name'],
                    current_cost_monthly=on_demand_cost,
                    recommended_cost_monthly=committed_cost,
                    savings_monthly=savings_monthly,
                    savings_annual=savings_monthly * 12,
                    action="Purchase 1-year committed use discount",
                    confidence=0.95,
                    details={
                        'machine_type': machine_type,
                        'uptime_days': uptime_days,
                        'discount_percentage': 30
                    }
                ))

        return recommendations

    def generate_report(self) -> pd.DataFrame:
        """Generate comprehensive cost optimization report"""
        all_recommendations = []

        all_recommendations.extend(self.analyze_compute_utilization())
        all_recommendations.extend(self.analyze_storage_usage())
        all_recommendations.extend(self.analyze_commitment_opportunities())

        # Convert to DataFrame
        df = pd.DataFrame([
            {
                'Resource Type': r.resource_type,
                'Resource Name': r.resource_name,
                'Current Cost/Month': f"${r.current_cost_monthly:,.2f}",
                'Recommended Cost/Month': f"${r.recommended_cost_monthly:,.2f}",
                'Savings/Month': f"${r.savings_monthly:,.2f}",
                'Savings/Year': f"${r.savings_annual:,.2f}",
                'Action': r.action,
                'Confidence': f"{r.confidence * 100:.0f}%"
            }
            for r in all_recommendations
        ])

        # Sort by annual savings descending
        df = df.sort_values('Savings/Year', ascending=False)

        return df

    def _execute_mql_query(self, query: str) -> List[Dict]:
        """Execute Monitoring Query Language query"""
        # Simplified - actual implementation would use monitoring API
        return []

    def _get_machine_type_cost(self, machine_type: str) -> float:
        """Get monthly cost for machine type"""
        # Simplified pricing - actual would use Billing API
        pricing = {
            'n1-standard-1': 24.27,
            'n1-standard-2': 48.55,
            'n2-standard-2': 56.50,
            'n2-standard-4': 113.00
        }
        return pricing.get(machine_type, 0)

    def _downsize_machine_type(self, current: str) -> str:
        """Recommend smaller machine type"""
        downsize_map = {
            'n2-standard-4': 'n2-standard-2',
            'n2-standard-2': 'n1-standard-1'
        }
        return downsize_map.get(current, current)

Usage Examples

Multi-Region Infrastructure

Apply cloud-infrastructure-patterns skill to design GCP multi-region architecture with Terraform

Cost Optimization

Apply cloud-infrastructure-patterns skill to analyze GCP costs and generate optimization recommendations

High Availability Setup

Apply cloud-infrastructure-patterns skill to implement regional HA with automatic failover

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: cloud-infrastructure-patterns

Completed:
- [x] Multi-region VPC with subnets in 3 regions configured
- [x] GKE clusters deployed across all regions with autoscaling
- [x] Cloud SQL PostgreSQL with regional HA and PITR enabled
- [x] Cloud Storage bucket with lifecycle policies and KMS encryption
- [x] Load balancer with CDN and health checks operational
- [x] Cloud NAT for private instance egress configured

Outputs:
- infrastructure/terraform/main.tf (Multi-region infrastructure as code)
- infrastructure/terraform/variables.tf (Configurable parameters)
- infrastructure/terraform/outputs.tf (Resource identifiers and endpoints)
- infrastructure/cost_optimization.py (Cost analysis and recommendations)
- docs/architecture-diagram.png (C4 context and container diagrams)
- docs/runbook.md (Operational procedures)

Infrastructure Metrics:
- High availability: 99.95% uptime SLA (regional failover tested)
- Cost optimization: $12,450/month saved via commitment discounts
- Security: All resources in private subnets, RLS enabled, KMS encrypted
- Scalability: Autoscaling handles 10x traffic spikes

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Terraform apply fails with validation or dependency errors
❌ VPC subnets have overlapping CIDR ranges
❌ GKE clusters cannot reach internet (Cloud NAT misconfigured)
❌ GKE pods cannot pull images (workload identity issue)
❌ Node pools do not autoscale under load
❌ Cloud SQL configured as ZONAL (no HA failover)
❌ Point-in-time recovery disabled (data loss risk)
❌ Cloud Storage bucket uses default encryption (not KMS)
❌ Lifecycle policies missing (unbounded storage growth)
❌ Load balancer health checks always succeed (not validating backend)
❌ CDN disabled or misconfigured (high latency for static content)
❌ Cost optimization script finds no savings (incomplete analysis)

When NOT to Use

Do NOT use this skill when:

Single-region deployment sufficient (no multi-region HA required)
Using serverless platforms (Cloud Run, Cloud Functions) instead of GKE
Infrastructure already provisioned and managed externally
Development/staging environments (use simpler, cheaper configurations)
Prototyping or proof-of-concept (overhead not justified)
Using managed Kubernetes services from other providers (EKS, AKS)
Infrastructure managed via UI console (not IaC)

Alternative approaches:

Single region: Remove region loop, deploy to one region only
Serverless: Use Cloud Run with auto-scaling and managed load balancing
Existing infra: Import resources into Terraform state
Dev/staging: Use preemptible VMs, smaller machine types, disable HA
Other providers: Adapt patterns to AWS (EKS, RDS) or Azure (AKS, PostgreSQL)

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Manual infrastructure changes	Configuration drift, not reproducible	Always update Terraform, never modify via console
No state file locking	Concurrent applies corrupt state	Use GCS backend with state locking
Hardcoded values in .tf files	Not reusable across environments	Use variables.tf and terraform.tfvars
No cost monitoring	Runaway spend undetected	Enable billing alerts, run cost optimization script
Public subnets for all resources	Security risk	Use private subnets + Cloud NAT for internet egress
No HA for databases	Single point of failure	Configure REGIONAL availability with replicas
Unencrypted storage	Compliance violation	Use KMS encryption for all storage
No lifecycle policies	Storage costs grow unbounded	Implement retention and deletion policies
Missing health checks	Load balancer routes to unhealthy backends	Configure HTTP health checks on /health endpoint
No autoscaling	Over-provision or under-provision	Enable cluster and node pool autoscaling

Principles

This skill embodies:

#1 Infrastructure as Code - All resources defined in version-controlled Terraform
#2 High Availability First - Regional redundancy, automatic failover, HA databases
#3 Security by Default - Private subnets, KMS encryption, workload identity, RLS
#4 Cost Optimization - Automated analysis identifies savings (commitments, right-sizing)
#5 Eliminate Ambiguity - Explicit resource configuration; no default assumptions
#6 Clear, Understandable, Explainable - Well-documented Terraform with output values
#8 No Assumptions - Validate all infrastructure with health checks and monitoring
#10 Automation First - Terraform provisions everything; no manual console clicks
#11 Observability - Logging, monitoring, and alerting configured for all resources

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Integration Points

k8s-statefulset-patterns - Kubernetes deployments
deployment-strategy-patterns - Deployment automation
cicd-automation-patterns - Infrastructure CI/CD

When to Use This Skill​

How to Use This Skill​

Core Capabilities​

Cloud Provider Selection Matrix​

Terraform Multi-Region GCP Infrastructure​

Cost Optimization Module​

Usage Examples​

Multi-Region Infrastructure​

Cost Optimization​

High Availability Setup​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Integration Points​