Agent Skills Framework Extension
Cloud Infrastructure Patterns Skill
When to Use This Skill
Use this skill when implementing cloud infrastructure patterns patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Multi-cloud infrastructure design with Infrastructure as Code, high availability, and cost optimization.
Core Capabilities
- Infrastructure as Code - Terraform, Pulumi, CloudFormation
- Multi-Cloud Design - GCP, AWS, Azure architecture patterns
- High Availability - Regional redundancy, failover, load balancing
- Cost Optimization - Resource right-sizing, commitment planning
- Security & Compliance - IAM, encryption, audit logging
Cloud Provider Selection Matrix
| Scenario | GCP | AWS | Azure | Recommendation |
|---|---|---|---|---|
| Kubernetes-first | ✅ GKE (best managed K8s) | EKS | AKS | GCP |
| Data analytics/ML | ✅ BigQuery, Vertex AI | Redshift, SageMaker | Synapse, Azure ML | GCP (data), AWS (ML breadth) |
| Enterprise/Microsoft stack | Cloud SQL | RDS | ✅ Azure SQL, AD integration | Azure |
| Serverless functions | Cloud Functions | ✅ Lambda (most mature) | Azure Functions | AWS |
| Global edge/CDN | Cloud CDN | ✅ CloudFront | Azure CDN | AWS |
| Cost sensitivity | Sustained use discounts | ✅ Spot instances (cheapest) | Reserved instances | AWS (spot), GCP (sustained) |
| Compliance (HIPAA/FedRAMP) | ✅ Strong | ✅ Strongest (GovCloud) | ✅ Strong | AWS (GovCloud) |
| Startup credits | $100K+ for startups | $100K for startups | $150K BizSpark | Azure (BizSpark) |
Quick Decision:
What's your primary workload?
├── Containers/Kubernetes → GCP (GKE)
├── Serverless/Event-driven → AWS (Lambda)
├── Microsoft/.NET/Enterprise → Azure
├── Data warehouse/Analytics → GCP (BigQuery)
├── Multi-region global app → AWS (most regions)
└── Cost-constrained startup → Compare credits + spot pricing
Terraform Multi-Region GCP Infrastructure
# infrastructure/terraform/main.tf
terraform {
required_version = ">= 1.5"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
backend "gcs" {
bucket = "terraform-state-prod"
prefix = "infrastructure/state"
}
}
# Variables
variable "project_id" {
description = "GCP project ID"
type = string
}
variable "regions" {
description = "Deployment regions"
type = list(string)
default = ["us-central1", "us-east1", "europe-west1"]
}
variable "environment" {
description = "Environment name"
type = string
}
# VPC Network with Multi-Region
resource "google_compute_network" "vpc" {
name = "${var.environment}-vpc"
auto_create_subnetworks = false
project = var.project_id
}
# Subnet in each region
resource "google_compute_subnetwork" "subnets" {
for_each = toset(var.regions)
name = "${var.environment}-subnet-${each.key}"
network = google_compute_network.vpc.id
region = each.key
ip_cidr_range = cidrsubnet("10.0.0.0/16", 8, index(var.regions, each.key))
project = var.project_id
secondary_ip_range {
range_name = "gke-pods"
ip_cidr_range = cidrsubnet("10.1.0.0/16", 8, index(var.regions, each.key))
}
secondary_ip_range {
range_name = "gke-services"
ip_cidr_range = cidrsubnet("10.2.0.0/16", 8, index(var.regions, each.key))
}
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
}
# Cloud NAT for private instances
resource "google_compute_router" "router" {
for_each = toset(var.regions)
name = "${var.environment}-router-${each.key}"
network = google_compute_network.vpc.id
region = each.key
project = var.project_id
}
resource "google_compute_router_nat" "nat" {
for_each = toset(var.regions)
name = "${var.environment}-nat-${each.key}"
router = google_compute_router.router[each.key].name
region = each.key
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
project = var.project_id
log_config {
enable = true
filter = "ERRORS_ONLY"
}
}
# GKE Cluster per region
resource "google_container_cluster" "primary" {
for_each = toset(var.regions)
name = "${var.environment}-gke-${each.key}"
location = each.key
project = var.project_id
# Use VPC-native cluster
network = google_compute_network.vpc.id
subnetwork = google_compute_subnetwork.subnets[each.key].id
# Remove default node pool
remove_default_node_pool = true
initial_node_count = 1
# Workload Identity for pod-level IAM
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Network policy
network_policy {
enabled = true
provider = "PROVIDER_UNSPECIFIED"
}
# IP allocation policy for VPC-native
ip_allocation_policy {
cluster_secondary_range_name = "gke-pods"
services_secondary_range_name = "gke-services"
}
# Master authorized networks
master_authorized_networks_config {
cidr_blocks {
cidr_block = "0.0.0.0/0"
display_name = "All networks (production should restrict)"
}
}
# Private cluster configuration
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = cidrsubnet("172.16.0.0/16", 8, index(var.regions, each.key))
}
# Maintenance window
maintenance_policy {
daily_maintenance_window {
start_time = "03:00"
}
}
# Binary authorization
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}
# Logging and monitoring
logging_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS"]
managed_prometheus {
enabled = true
}
}
}
# GKE Node Pool with autoscaling
resource "google_container_node_pool" "primary_nodes" {
for_each = toset(var.regions)
name = "${var.environment}-node-pool-${each.key}"
location = each.key
cluster = google_container_cluster.primary[each.key].name
node_count = 1
project = var.project_id
autoscaling {
min_node_count = 1
max_node_count = 10
}
node_config {
preemptible = var.environment != "production"
machine_type = "n2-standard-4"
# Workload Identity
workload_metadata_config {
mode = "GKE_METADATA"
}
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
environment = var.environment
region = each.key
}
tags = ["gke-node", var.environment]
# Shielded instance
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
# Spot instances for non-prod
dynamic "spot" {
for_each = var.environment != "production" ? [1] : []
content {
spot = true
}
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Cloud SQL PostgreSQL with HA
resource "google_sql_database_instance" "postgres" {
name = "${var.environment}-postgres"
database_version = "POSTGRES_15"
region = var.regions[0]
project = var.project_id
settings {
tier = "db-custom-4-16384"
availability_type = "REGIONAL" # HA with failover replica
backup_configuration {
enabled = true
start_time = "03:00"
point_in_time_recovery_enabled = true
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
}
}
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.id
require_ssl = true
}
database_flags {
name = "max_connections"
value = "200"
}
database_flags {
name = "log_checkpoints"
value = "on"
}
insights_config {
query_insights_enabled = true
query_string_length = 1024
record_application_tags = true
}
maintenance_window {
day = 7 # Sunday
hour = 3
update_track = "stable"
}
}
deletion_protection = var.environment == "production"
}
# Cloud Storage bucket with versioning
resource "google_storage_bucket" "artifacts" {
name = "${var.project_id}-${var.environment}-artifacts"
location = "US"
project = var.project_id
force_destroy = var.environment != "production"
uniform_bucket_level_access = true
versioning {
enabled = true
}
lifecycle_rule {
condition {
age = 90
}
action {
type = "Delete"
}
}
lifecycle_rule {
condition {
num_newer_versions = 3
}
action {
type = "Delete"
}
}
encryption {
default_kms_key_name = google_kms_crypto_key.storage.id
}
}
# Cloud KMS for encryption
resource "google_kms_key_ring" "main" {
name = "${var.environment}-keyring"
location = var.regions[0]
project = var.project_id
}
resource "google_kms_crypto_key" "storage" {
name = "storage-encryption-key"
key_ring = google_kms_key_ring.main.id
rotation_period = "7776000s" # 90 days
lifecycle {
prevent_destroy = true
}
}
# Load Balancer with CDN
resource "google_compute_global_address" "lb_ip" {
name = "${var.environment}-lb-ip"
project = var.project_id
}
resource "google_compute_backend_service" "backend" {
name = "${var.environment}-backend"
protocol = "HTTP"
port_name = "http"
timeout_sec = 30
enable_cdn = true
project = var.project_id
cdn_policy {
cache_mode = "CACHE_ALL_STATIC"
default_ttl = 3600
max_ttl = 86400
negative_caching = true
}
health_checks = [google_compute_health_check.http.id]
log_config {
enable = true
sample_rate = 1.0
}
}
resource "google_compute_health_check" "http" {
name = "${var.environment}-health-check"
check_interval_sec = 10
timeout_sec = 5
project = var.project_id
http_health_check {
port = 8080
request_path = "/health"
}
}
# Outputs
output "vpc_id" {
value = google_compute_network.vpc.id
}
output "gke_clusters" {
value = {
for region, cluster in google_container_cluster.primary :
region => cluster.endpoint
}
sensitive = true
}
output "sql_connection" {
value = google_sql_database_instance.postgres.connection_name
sensitive = true
}
output "lb_ip" {
value = google_compute_global_address.lb_ip.address
}
Cost Optimization Module
# infrastructure/cost_optimization.py
"""
Cloud cost optimization and resource right-sizing
"""
from dataclasses import dataclass
from typing import Dict, List
import pandas as pd
from google.cloud import billing_v1, monitoring_v3
from datetime import datetime, timedelta
@dataclass
class ResourceRecommendation:
"""Cost optimization recommendation"""
resource_type: str
resource_name: str
current_cost_monthly: float
recommended_cost_monthly: float
savings_monthly: float
savings_annual: float
action: str
confidence: float
details: Dict[str, any]
class CostOptimizer:
"""Analyze and optimize cloud infrastructure costs"""
def __init__(self, project_id: str):
self.project_id = project_id
self.billing_client = billing_v1.CloudBillingClient()
self.monitoring_client = monitoring_v3.QueryServiceClient()
def analyze_compute_utilization(self) -> List[ResourceRecommendation]:
"""Analyze VM and GKE node utilization"""
recommendations = []
# Query CPU utilization for past 30 days
query = """
fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 1d, [value_utilization_mean: mean(value.utilization)]
| within 30d
"""
results = self._execute_mql_query(query)
for instance in results:
avg_cpu = instance['cpu_utilization']
current_machine_type = instance['machine_type']
current_cost = self._get_machine_type_cost(current_machine_type)
# Recommend downsize if CPU < 30%
if avg_cpu < 0.30:
recommended_type = self._downsize_machine_type(current_machine_type)
recommended_cost = self._get_machine_type_cost(recommended_type)
savings_monthly = current_cost - recommended_cost
recommendations.append(ResourceRecommendation(
resource_type="compute_instance",
resource_name=instance['name'],
current_cost_monthly=current_cost,
recommended_cost_monthly=recommended_cost,
savings_monthly=savings_monthly,
savings_annual=savings_monthly * 12,
action=f"Downsize from {current_machine_type} to {recommended_type}",
confidence=0.9 if avg_cpu < 0.20 else 0.7,
details={
'avg_cpu_utilization': avg_cpu,
'current_machine_type': current_machine_type,
'recommended_machine_type': recommended_type
}
))
return recommendations
def analyze_storage_usage(self) -> List[ResourceRecommendation]:
"""Analyze storage class optimization opportunities"""
recommendations = []
# Check for buckets with infrequently accessed data
query = """
fetch gcs_bucket
| metric 'storage.googleapis.com/storage/total_bytes'
| group_by 7d, [value_total_bytes_mean: mean(value.total_bytes)]
| within 90d
"""
# Also query access patterns
access_query = """
fetch gcs_bucket
| metric 'storage.googleapis.com/api/request_count'
| filter metric.method == 'ReadObject'
| group_by 30d, [value_request_count_sum: sum(value.request_count)]
| within 90d
"""
storage_results = self._execute_mql_query(query)
access_results = self._execute_mql_query(access_query)
for bucket in storage_results:
bucket_name = bucket['bucket_name']
size_gb = bucket['total_bytes'] / (1024**3)
access_count = access_results.get(bucket_name, {}).get('request_count', 0)
# Recommend Nearline/Coldline for infrequently accessed data
if access_count < 10 and size_gb > 100: # < 10 accesses/month
current_cost = size_gb * 0.020 # Standard storage $0.020/GB
recommended_cost = size_gb * 0.010 # Nearline $0.010/GB
recommendations.append(ResourceRecommendation(
resource_type="gcs_bucket",
resource_name=bucket_name,
current_cost_monthly=current_cost,
recommended_cost_monthly=recommended_cost,
savings_monthly=current_cost - recommended_cost,
savings_annual=(current_cost - recommended_cost) * 12,
action="Migrate to Nearline storage class",
confidence=0.85,
details={
'size_gb': size_gb,
'access_count_30d': access_count,
'current_class': 'STANDARD',
'recommended_class': 'NEARLINE'
}
))
return recommendations
def analyze_commitment_opportunities(self) -> List[ResourceRecommendation]:
"""Identify resources eligible for committed use discounts"""
recommendations = []
# Find stable workloads running for >30 days
query = """
fetch gce_instance
| filter resource.instance_id != ''
| within 90d
"""
results = self._execute_mql_query(query)
for instance in results:
uptime_days = instance['uptime_days']
if uptime_days > 30: # Stable workload
machine_type = instance['machine_type']
on_demand_cost = self._get_machine_type_cost(machine_type)
committed_cost = on_demand_cost * 0.70 # 30% discount for 1-year commitment
savings_monthly = on_demand_cost - committed_cost
recommendations.append(ResourceRecommendation(
resource_type="commitment",
resource_name=instance['name'],
current_cost_monthly=on_demand_cost,
recommended_cost_monthly=committed_cost,
savings_monthly=savings_monthly,
savings_annual=savings_monthly * 12,
action="Purchase 1-year committed use discount",
confidence=0.95,
details={
'machine_type': machine_type,
'uptime_days': uptime_days,
'discount_percentage': 30
}
))
return recommendations
def generate_report(self) -> pd.DataFrame:
"""Generate comprehensive cost optimization report"""
all_recommendations = []
all_recommendations.extend(self.analyze_compute_utilization())
all_recommendations.extend(self.analyze_storage_usage())
all_recommendations.extend(self.analyze_commitment_opportunities())
# Convert to DataFrame
df = pd.DataFrame([
{
'Resource Type': r.resource_type,
'Resource Name': r.resource_name,
'Current Cost/Month': f"${r.current_cost_monthly:,.2f}",
'Recommended Cost/Month': f"${r.recommended_cost_monthly:,.2f}",
'Savings/Month': f"${r.savings_monthly:,.2f}",
'Savings/Year': f"${r.savings_annual:,.2f}",
'Action': r.action,
'Confidence': f"{r.confidence * 100:.0f}%"
}
for r in all_recommendations
])
# Sort by annual savings descending
df = df.sort_values('Savings/Year', ascending=False)
return df
def _execute_mql_query(self, query: str) -> List[Dict]:
"""Execute Monitoring Query Language query"""
# Simplified - actual implementation would use monitoring API
return []
def _get_machine_type_cost(self, machine_type: str) -> float:
"""Get monthly cost for machine type"""
# Simplified pricing - actual would use Billing API
pricing = {
'n1-standard-1': 24.27,
'n1-standard-2': 48.55,
'n2-standard-2': 56.50,
'n2-standard-4': 113.00
}
return pricing.get(machine_type, 0)
def _downsize_machine_type(self, current: str) -> str:
"""Recommend smaller machine type"""
downsize_map = {
'n2-standard-4': 'n2-standard-2',
'n2-standard-2': 'n1-standard-1'
}
return downsize_map.get(current, current)
Usage Examples
Multi-Region Infrastructure
Apply cloud-infrastructure-patterns skill to design GCP multi-region architecture with Terraform
Cost Optimization
Apply cloud-infrastructure-patterns skill to analyze GCP costs and generate optimization recommendations
High Availability Setup
Apply cloud-infrastructure-patterns skill to implement regional HA with automatic failover
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: cloud-infrastructure-patterns
Completed:
- [x] Multi-region VPC with subnets in 3 regions configured
- [x] GKE clusters deployed across all regions with autoscaling
- [x] Cloud SQL PostgreSQL with regional HA and PITR enabled
- [x] Cloud Storage bucket with lifecycle policies and KMS encryption
- [x] Load balancer with CDN and health checks operational
- [x] Cloud NAT for private instance egress configured
Outputs:
- infrastructure/terraform/main.tf (Multi-region infrastructure as code)
- infrastructure/terraform/variables.tf (Configurable parameters)
- infrastructure/terraform/outputs.tf (Resource identifiers and endpoints)
- infrastructure/cost_optimization.py (Cost analysis and recommendations)
- docs/architecture-diagram.png (C4 context and container diagrams)
- docs/runbook.md (Operational procedures)
Infrastructure Metrics:
- High availability: 99.95% uptime SLA (regional failover tested)
- Cost optimization: $12,450/month saved via commitment discounts
- Security: All resources in private subnets, RLS enabled, KMS encrypted
- Scalability: Autoscaling handles 10x traffic spikes
Completion Checklist
Before marking this skill as complete, verify:
- Terraform configuration validates with
terraform validate - VPC network created with auto_create_subnetworks = false
- Subnets provisioned in all target regions with secondary ranges for GKE
- Cloud NAT configured for private instance internet access
- GKE clusters use VPC-native networking (IP allocation policy)
- GKE workload identity enabled for pod-level IAM
- Node pools configured with autoscaling (min/max nodes)
- Cloud SQL configured with REGIONAL availability (HA replica)
- Point-in-time recovery enabled on Cloud SQL (7-day transaction logs)
- Cloud Storage bucket uses KMS encryption with 90-day key rotation
- Lifecycle policies delete old versions and aged objects
- Load balancer health checks validate backend availability
- CDN enabled for static content with cache policies
- Cost optimization script identifies underutilized resources
- All outputs exist at expected locations and pass validation
Failure Indicators
This skill has FAILED if:
- ❌ Terraform apply fails with validation or dependency errors
- ❌ VPC subnets have overlapping CIDR ranges
- ❌ GKE clusters cannot reach internet (Cloud NAT misconfigured)
- ❌ GKE pods cannot pull images (workload identity issue)
- ❌ Node pools do not autoscale under load
- ❌ Cloud SQL configured as ZONAL (no HA failover)
- ❌ Point-in-time recovery disabled (data loss risk)
- ❌ Cloud Storage bucket uses default encryption (not KMS)
- ❌ Lifecycle policies missing (unbounded storage growth)
- ❌ Load balancer health checks always succeed (not validating backend)
- ❌ CDN disabled or misconfigured (high latency for static content)
- ❌ Cost optimization script finds no savings (incomplete analysis)
When NOT to Use
Do NOT use this skill when:
- Single-region deployment sufficient (no multi-region HA required)
- Using serverless platforms (Cloud Run, Cloud Functions) instead of GKE
- Infrastructure already provisioned and managed externally
- Development/staging environments (use simpler, cheaper configurations)
- Prototyping or proof-of-concept (overhead not justified)
- Using managed Kubernetes services from other providers (EKS, AKS)
- Infrastructure managed via UI console (not IaC)
Alternative approaches:
- Single region: Remove region loop, deploy to one region only
- Serverless: Use Cloud Run with auto-scaling and managed load balancing
- Existing infra: Import resources into Terraform state
- Dev/staging: Use preemptible VMs, smaller machine types, disable HA
- Other providers: Adapt patterns to AWS (EKS, RDS) or Azure (AKS, PostgreSQL)
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Manual infrastructure changes | Configuration drift, not reproducible | Always update Terraform, never modify via console |
| No state file locking | Concurrent applies corrupt state | Use GCS backend with state locking |
| Hardcoded values in .tf files | Not reusable across environments | Use variables.tf and terraform.tfvars |
| No cost monitoring | Runaway spend undetected | Enable billing alerts, run cost optimization script |
| Public subnets for all resources | Security risk | Use private subnets + Cloud NAT for internet egress |
| No HA for databases | Single point of failure | Configure REGIONAL availability with replicas |
| Unencrypted storage | Compliance violation | Use KMS encryption for all storage |
| No lifecycle policies | Storage costs grow unbounded | Implement retention and deletion policies |
| Missing health checks | Load balancer routes to unhealthy backends | Configure HTTP health checks on /health endpoint |
| No autoscaling | Over-provision or under-provision | Enable cluster and node pool autoscaling |
Principles
This skill embodies:
- #1 Infrastructure as Code - All resources defined in version-controlled Terraform
- #2 High Availability First - Regional redundancy, automatic failover, HA databases
- #3 Security by Default - Private subnets, KMS encryption, workload identity, RLS
- #4 Cost Optimization - Automated analysis identifies savings (commitments, right-sizing)
- #5 Eliminate Ambiguity - Explicit resource configuration; no default assumptions
- #6 Clear, Understandable, Explainable - Well-documented Terraform with output values
- #8 No Assumptions - Validate all infrastructure with health checks and monitoring
- #10 Automation First - Terraform provisions everything; no manual console clicks
- #11 Observability - Logging, monitoring, and alerting configured for all resources
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Integration Points
- k8s-statefulset-patterns - Kubernetes deployments
- deployment-strategy-patterns - Deployment automation
- cicd-automation-patterns - Infrastructure CI/CD