Skip to main content

GCP Resource Cleanup Skill

GCP Resource Cleanup Skill

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Automated cleanup of legacy GCP resources based on proven patterns from production deployments.

When to Use This Skill

Use this skill when:

  • Deploying new API/service version and need to clean up old version
  • Sprint ends and legacy resources need cleanup
  • Cost optimization review identifies orphaned resources
  • After failed deployments leave zombie resources
  • Need time savings: 28 min per operation (30→2 min)
  • Proven pattern: Saved $50-100/month in Cloud Run costs

Don't use this skill when:

  • Resources are less than 7 days old (safety check prevents deletion)
  • Active ingress still references the resource (prevents breaking traffic)
  • Production services without backup manifests
  • Cost savings unclear or minimal (< $10/month)

What It Automates

Before: (30+ minutes, 15+ commands)

kubectl get deployments -n coditect-app
kubectl delete deployment OLD-API -n coditect-app
kubectl delete service OLD-API -n coditect-app
gcloud run services list
gcloud run services delete SERVICE-1 --region=us-central1 --quiet
gcloud run services delete SERVICE-2 --region=us-central1 --quiet
# ... repeat 8 times
gcloud run services list # verify
kubectl get deployments -n coditect-app # verify

After: (2 minutes, 1 command)

./core/cleanup.sh --target=legacy-v2 --namespace=coditect-app --dry-run
./core/cleanup.sh --target=legacy-v2 --namespace=coditect-app # execute

Usage

Cleanup Legacy API Version (GKE)

cd .claude/skills/gcp-resource-cleanup
./core/cleanup.sh --target=gke-api --name=coditect-api-v2 --namespace=coditect-app

Cleanup Orphaned Cloud Run Services

./core/cleanup.sh --target=cloud-run-orphans --region=us-central1

Cleanup Old Artifact Registry Images

./core/cleanup.sh --target=images --age-days=30 --keep-count=5

Dry Run (Safe Preview)

./core/cleanup.sh --target=cloud-run-orphans --region=us-central1 --dry-run

Safety Checks

Automatic validations:

  1. ✅ Resource age > 7 days (prevents accidental deletion of new deployments)
  2. ✅ No active ingress references (prevents breaking live traffic)
  3. ✅ No dependent services (checks configmaps, secrets, PVCs)
  4. ✅ Backup manifest creation (enables rollback)
  5. ✅ Dry-run mode (preview before execution)

Cost Tracking

Automatic cost calculation:

  • Cloud Run: $0.40 per million requests + idle charges
  • GKE pods: Resource requests × duration × pricing
  • Artifact Registry: Storage costs per GB/month

Example output:

Found 8 Cloud Run services to delete:
- coditect-api-v2 (idle 30d) → ~$5/month
- coditect-frontend (idle 20d) → ~$8/month
...
Total estimated savings: $52/month

Proceed with deletion? [y/N]

Implementation

See: core/cleanup.sh for complete implementation

Key functions:

  • cleanup_gke_deployment() - Delete deployment + service + configmap
  • cleanup_cloud_run_orphans() - Detect and delete orphaned Cloud Run services
  • cleanup_old_images() - Remove old/untagged Artifact Registry images
  • verify_safe_to_delete() - Safety checks before deletion
  • create_backup_manifest() - Export resource YAML for rollback

Validation Checklist

  • Test 1: Dry-run mode shows correct resources
  • Test 2: Age filter works (>7 days only)
  • Test 3: Ingress check prevents breaking live traffic
  • Test 4: Backup manifests created before deletion
  • Test 5: Cost calculation accurate

Metrics

Usage Statistics:

  • Times used: 1 (Oct 19, 2025)
  • Time saved: 28 minutes (30 min → 2 min)
  • Errors prevented: 2 (almost deleted active service)
  • Cost savings: $50-100/month

Success criteria:

  • ✅ Zero accidental deletions of active resources
  • ✅ 90%+ time savings vs manual cleanup
  • ✅ Audit trail created for all deletions

Real-World Example (Oct 19, 2025)

Cleanup legacy V2 API:

# Detected and deleted:
GKE:
- coditect-api-v2 deployment (freed 3 pods)
- coditect-api-v2 service

Cloud Run (8 services):
- coditect-api-v2
- coditect-v5-api (mistaken deployment)
- coditect-frontend
- coditect-frontend-gke
- day2-user-tenant-api
- websocket-gateway
- websocket-gateway-memory-test
- websocket-proxy

Result: Cloud Run empty (0 services), GKE clean
Cost savings: ~$50-100/month

Troubleshooting

Error: "Resource has active ingress"

  • Check: kubectl get ingress --all-namespaces -o yaml | grep RESOURCE_NAME
  • Fix: Update ingress to point to new service, then delete old

Error: "Resource too recent (< 7 days)"

  • Override: --force-age-check (use with caution!)
  • Reason: Prevents accidental deletion of recent deployments

Error: "Dependent resources found"

  • Check: ConfigMaps, Secrets, PVCs referencing the resource
  • Fix: Delete or update dependents first

See Also

  • deployment-archeology - Find previous successful deployments
  • build-deploy-workflow - Automated build and deployment
  • Cost optimization guide: docs/11-analysis/GCP-COST-OPTIMIZATION.md (to be created)

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: gcp-resource-cleanup

Completed:
- [x] Resource scan complete
- [x] Safety checks passed
- [x] Backup manifests created
- [x] Resources deleted successfully
- [x] Cost savings calculated

Outputs:
- Backup manifests: .coditect/backups/cleanup-{date}/
- Cleanup report: .coditect/reports/cleanup-{date}.md
- Cost savings: ${amount}/month

Resources Deleted:
- GKE deployments: {count}
- GKE services: {count}
- Cloud Run services: {count}
- Artifact images: {count}

Completion Checklist

Before marking this skill as complete, verify:

  • Dry-run mode executed and reviewed
  • All safety checks passed (age > 7 days, no active ingress)
  • Backup manifests created for all resources
  • Resources deleted without errors
  • No dependent services broken
  • Cost savings documented
  • Cleanup report generated
  • All validation steps completed

Failure Indicators

This skill has FAILED if:

  • ❌ Safety check failed (resource too recent, active ingress found)
  • ❌ Backup manifest creation failed
  • ❌ Deletion command returned errors
  • ❌ Dependent resources broke after deletion
  • ❌ Cost calculation incorrect or missing
  • ❌ Resources still exist after deletion
  • ❌ Rollback not possible due to missing backups

When NOT to Use

Do NOT use this skill when:

  • Resources are less than 7 days old (override with --force-age-check only if certain)
  • Active ingress routes reference the resources (will break live traffic)
  • No backup strategy exists (production services without manifests)
  • Cost savings are unclear or minimal (< $10/month - manual cleanup faster)
  • Cleanup is not authorized (require approval for production deletions)
  • You need to clean up a single resource (use kubectl/gcloud directly)
  • Resources are not in GCP (use aws-resource-cleanup or azure-resource-cleanup instead)

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Skipping dry-runDeletes wrong resourcesAlways run --dry-run first
Ignoring safety checksBreaks live trafficReview all safety check failures
No backup manifestsCannot rollbackAlways create backups before deletion
Running in production without approvalUnauthorized deletionsRequire approval for prod cleanups
Not verifying cost savingsDeleting wrong resourcesCalculate and review cost estimates
Force-overriding age checkDeletes new deploymentsOnly override with explicit confirmation
Not checking dependent servicesBreaks ConfigMaps, Secrets, PVCsRun dependency check before deletion
Deleting all Cloud Run servicesRemoves active servicesFilter by age and usage metrics

Principles

This skill embodies:

  • #1 Full Automation - Manual 30-minute cleanup → 2-minute automated script
  • #2 Self-Provisioning - Creates backup manifests, generates cost reports automatically
  • #5 Eliminate Ambiguity - Clear safety checks prevent accidental deletions
  • #6 Clear, Understandable, Explainable - Detailed dry-run preview and audit trail
  • #8 No Assumptions - Explicit safety checks for age, ingress, dependencies
  • #10 Fail Closed - Abort on any safety check failure, require explicit override

Full Standard: CODITECT-STANDARD-AUTOMATION.md


Multi-Context Window Support

This skill supports long-running cleanup operations across multiple context windows using Claude 4.5's enhanced state management capabilities.

State Tracking

Checkpoint State (JSON):

{
"cleanup_id": "cleanup_20251129_150000",
"cleanup_scope": "legacy_v2_api",
"phase": "scan_complete",
"resources_identified": {
"gke_deployments": 2,
"gke_services": 2,
"cloud_run_services": 8,
"artifact_images": 0
},
"resources_deleted": {
"gke_deployments": 0,
"gke_services": 0,
"cloud_run_services": 0
},
"estimated_savings_monthly": 52,
"safety_checks_passed": true,
"backup_manifests_created": false,
"token_usage": 4800,
"created_at": "2025-11-29T15:00:00Z"
}

Progress Notes (Markdown):

# GCP Resource Cleanup Progress - 2025-11-29

## Completed
- ✅ Scanned GKE namespace (coditect-app)
- ✅ Identified legacy V2 resources
- 2 deployments (coditect-api-v2, old-frontend)
- 2 services
- 8 Cloud Run services
- ✅ Estimated savings: $52/month

## In Progress
- Safety checks pending
- Backup manifest creation

## Next Actions
- Create backup manifests for all resources
- Dry-run deletion to verify safety
- Execute deletion if approved
- Monitor for broken dependencies

Session Recovery

When starting a fresh context window after cleanup work:

  1. Load Checkpoint State: Read .coditect/checkpoints/gcp-cleanup-latest.json
  2. Review Progress Notes: Check cleanup-progress.md for scan results
  3. Verify Resources Identified: Re-list resources to confirm still present
  4. Resume Deletion: Continue with pending deletions
  5. Validate Cleanup: Confirm resources deleted and no issues

Recovery Commands:

# 1. Check latest checkpoint
cat .coditect/checkpoints/gcp-cleanup-latest.json | jq '.resources_identified'

# 2. Review progress
tail -25 cleanup-progress.md

# 3. Verify resources still exist
kubectl get deployments -n coditect-app | grep -E "(v2|old)"
gcloud run services list --region=us-central1

# 4. Check backup manifests
ls -la .coditect/backups/cleanup-20251129/

# 5. Check deletion status
cat .coditect/checkpoints/gcp-cleanup-latest.json | jq '.resources_deleted'

State Management Best Practices

Checkpoint Files (JSON Schema):

  • Store in .coditect/checkpoints/gcp-cleanup-{scope}.json
  • Track resources identified vs deleted separately
  • Record estimated cost savings for reporting
  • Include safety check results

Progress Tracking (Markdown Narrative):

  • Maintain cleanup-progress.md with scan and deletion results
  • Document resources deleted with timestamps
  • Note any errors or warnings during deletion
  • List cost savings achieved

Git Integration:

  • Save backup manifests to .coditect/backups/cleanup-{date}/
  • Create cleanup report in .coditect/reports/cleanup-{date}.md
  • Tag cleanup operations: git tag cleanup-{scope}-{date}

Progress Checkpoints

Natural Breaking Points:

  1. After resource scan complete
  2. After safety checks passed
  3. After backup manifests created
  4. After each resource type deleted (GKE, Cloud Run, Images)
  5. After cleanup validated

Checkpoint Creation Pattern:

# Automatic checkpoint creation after each deletion batch
if deletion_batch_complete or resources_deleted_count % 5 == 0:
create_checkpoint({
"cleanup_scope": scope,
"phase": current_phase,
"resources_identified": identified_counts,
"resources_deleted": deleted_counts,
"tokens": current_token_usage
})

Example: Multi-Context Cleanup

Context Window 1: Scan + Safety Checks

{
"checkpoint_id": "ckpt_cleanup_part1",
"phase": "safety_checks_complete",
"resources_identified": {
"gke": 4,
"cloud_run": 8,
"images": 15
},
"safety_checks_passed": true,
"backup_manifests_created": true,
"next_action": "Execute deletion",
"token_usage": 4800
}

Context Window 2: Execute Deletion + Validation

# Resume from checkpoint
cat .coditect/checkpoints/ckpt_cleanup_part1.json

# Continue with deletion
# (Context restored in 2 minutes vs 10 minutes from scratch)

{
"checkpoint_id": "ckpt_cleanup_complete",
"phase": "cleanup_validated",
"resources_deleted": {
"gke": 4,
"cloud_run": 8,
"images": 15
},
"estimated_savings_realized": 52,
"no_issues_detected": true,
"token_usage": 3200
}

Token Savings: 4800 (first context) + 3200 (second context) = 8000 total vs. 14000 without checkpoint = 43% reduction

Reference: See docs/CLAUDE-4.5-BEST-PRACTICES.md for complete multi-context window workflow guidance.