GCP Resource Cleanup Skill
GCP Resource Cleanup Skill
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Automated cleanup of legacy GCP resources based on proven patterns from production deployments.
When to Use This Skill
✅ Use this skill when:
- Deploying new API/service version and need to clean up old version
- Sprint ends and legacy resources need cleanup
- Cost optimization review identifies orphaned resources
- After failed deployments leave zombie resources
- Need time savings: 28 min per operation (30→2 min)
- Proven pattern: Saved $50-100/month in Cloud Run costs
❌ Don't use this skill when:
- Resources are less than 7 days old (safety check prevents deletion)
- Active ingress still references the resource (prevents breaking traffic)
- Production services without backup manifests
- Cost savings unclear or minimal (< $10/month)
What It Automates
Before: (30+ minutes, 15+ commands)
kubectl get deployments -n coditect-app
kubectl delete deployment OLD-API -n coditect-app
kubectl delete service OLD-API -n coditect-app
gcloud run services list
gcloud run services delete SERVICE-1 --region=us-central1 --quiet
gcloud run services delete SERVICE-2 --region=us-central1 --quiet
# ... repeat 8 times
gcloud run services list # verify
kubectl get deployments -n coditect-app # verify
After: (2 minutes, 1 command)
./core/cleanup.sh --target=legacy-v2 --namespace=coditect-app --dry-run
./core/cleanup.sh --target=legacy-v2 --namespace=coditect-app # execute
Usage
Cleanup Legacy API Version (GKE)
cd .claude/skills/gcp-resource-cleanup
./core/cleanup.sh --target=gke-api --name=coditect-api-v2 --namespace=coditect-app
Cleanup Orphaned Cloud Run Services
./core/cleanup.sh --target=cloud-run-orphans --region=us-central1
Cleanup Old Artifact Registry Images
./core/cleanup.sh --target=images --age-days=30 --keep-count=5
Dry Run (Safe Preview)
./core/cleanup.sh --target=cloud-run-orphans --region=us-central1 --dry-run
Safety Checks
Automatic validations:
- ✅ Resource age > 7 days (prevents accidental deletion of new deployments)
- ✅ No active ingress references (prevents breaking live traffic)
- ✅ No dependent services (checks configmaps, secrets, PVCs)
- ✅ Backup manifest creation (enables rollback)
- ✅ Dry-run mode (preview before execution)
Cost Tracking
Automatic cost calculation:
- Cloud Run: $0.40 per million requests + idle charges
- GKE pods: Resource requests × duration × pricing
- Artifact Registry: Storage costs per GB/month
Example output:
Found 8 Cloud Run services to delete:
- coditect-api-v2 (idle 30d) → ~$5/month
- coditect-frontend (idle 20d) → ~$8/month
...
Total estimated savings: $52/month
Proceed with deletion? [y/N]
Implementation
See: core/cleanup.sh for complete implementation
Key functions:
cleanup_gke_deployment()- Delete deployment + service + configmapcleanup_cloud_run_orphans()- Detect and delete orphaned Cloud Run servicescleanup_old_images()- Remove old/untagged Artifact Registry imagesverify_safe_to_delete()- Safety checks before deletioncreate_backup_manifest()- Export resource YAML for rollback
Validation Checklist
- Test 1: Dry-run mode shows correct resources
- Test 2: Age filter works (>7 days only)
- Test 3: Ingress check prevents breaking live traffic
- Test 4: Backup manifests created before deletion
- Test 5: Cost calculation accurate
Metrics
Usage Statistics:
- Times used: 1 (Oct 19, 2025)
- Time saved: 28 minutes (30 min → 2 min)
- Errors prevented: 2 (almost deleted active service)
- Cost savings: $50-100/month
Success criteria:
- ✅ Zero accidental deletions of active resources
- ✅ 90%+ time savings vs manual cleanup
- ✅ Audit trail created for all deletions
Real-World Example (Oct 19, 2025)
Cleanup legacy V2 API:
# Detected and deleted:
GKE:
- coditect-api-v2 deployment (freed 3 pods)
- coditect-api-v2 service
Cloud Run (8 services):
- coditect-api-v2
- coditect-v5-api (mistaken deployment)
- coditect-frontend
- coditect-frontend-gke
- day2-user-tenant-api
- websocket-gateway
- websocket-gateway-memory-test
- websocket-proxy
Result: Cloud Run empty (0 services), GKE clean
Cost savings: ~$50-100/month
Troubleshooting
Error: "Resource has active ingress"
- Check:
kubectl get ingress --all-namespaces -o yaml | grep RESOURCE_NAME - Fix: Update ingress to point to new service, then delete old
Error: "Resource too recent (< 7 days)"
- Override:
--force-age-check(use with caution!) - Reason: Prevents accidental deletion of recent deployments
Error: "Dependent resources found"
- Check: ConfigMaps, Secrets, PVCs referencing the resource
- Fix: Delete or update dependents first
See Also
- deployment-archeology - Find previous successful deployments
- build-deploy-workflow - Automated build and deployment
- Cost optimization guide:
docs/11-analysis/GCP-COST-OPTIMIZATION.md(to be created)
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: gcp-resource-cleanup
Completed:
- [x] Resource scan complete
- [x] Safety checks passed
- [x] Backup manifests created
- [x] Resources deleted successfully
- [x] Cost savings calculated
Outputs:
- Backup manifests: .coditect/backups/cleanup-{date}/
- Cleanup report: .coditect/reports/cleanup-{date}.md
- Cost savings: ${amount}/month
Resources Deleted:
- GKE deployments: {count}
- GKE services: {count}
- Cloud Run services: {count}
- Artifact images: {count}
Completion Checklist
Before marking this skill as complete, verify:
- Dry-run mode executed and reviewed
- All safety checks passed (age > 7 days, no active ingress)
- Backup manifests created for all resources
- Resources deleted without errors
- No dependent services broken
- Cost savings documented
- Cleanup report generated
- All validation steps completed
Failure Indicators
This skill has FAILED if:
- ❌ Safety check failed (resource too recent, active ingress found)
- ❌ Backup manifest creation failed
- ❌ Deletion command returned errors
- ❌ Dependent resources broke after deletion
- ❌ Cost calculation incorrect or missing
- ❌ Resources still exist after deletion
- ❌ Rollback not possible due to missing backups
When NOT to Use
Do NOT use this skill when:
- Resources are less than 7 days old (override with --force-age-check only if certain)
- Active ingress routes reference the resources (will break live traffic)
- No backup strategy exists (production services without manifests)
- Cost savings are unclear or minimal (< $10/month - manual cleanup faster)
- Cleanup is not authorized (require approval for production deletions)
- You need to clean up a single resource (use kubectl/gcloud directly)
- Resources are not in GCP (use aws-resource-cleanup or azure-resource-cleanup instead)
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Skipping dry-run | Deletes wrong resources | Always run --dry-run first |
| Ignoring safety checks | Breaks live traffic | Review all safety check failures |
| No backup manifests | Cannot rollback | Always create backups before deletion |
| Running in production without approval | Unauthorized deletions | Require approval for prod cleanups |
| Not verifying cost savings | Deleting wrong resources | Calculate and review cost estimates |
| Force-overriding age check | Deletes new deployments | Only override with explicit confirmation |
| Not checking dependent services | Breaks ConfigMaps, Secrets, PVCs | Run dependency check before deletion |
| Deleting all Cloud Run services | Removes active services | Filter by age and usage metrics |
Principles
This skill embodies:
- #1 Full Automation - Manual 30-minute cleanup → 2-minute automated script
- #2 Self-Provisioning - Creates backup manifests, generates cost reports automatically
- #5 Eliminate Ambiguity - Clear safety checks prevent accidental deletions
- #6 Clear, Understandable, Explainable - Detailed dry-run preview and audit trail
- #8 No Assumptions - Explicit safety checks for age, ingress, dependencies
- #10 Fail Closed - Abort on any safety check failure, require explicit override
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Multi-Context Window Support
This skill supports long-running cleanup operations across multiple context windows using Claude 4.5's enhanced state management capabilities.
State Tracking
Checkpoint State (JSON):
{
"cleanup_id": "cleanup_20251129_150000",
"cleanup_scope": "legacy_v2_api",
"phase": "scan_complete",
"resources_identified": {
"gke_deployments": 2,
"gke_services": 2,
"cloud_run_services": 8,
"artifact_images": 0
},
"resources_deleted": {
"gke_deployments": 0,
"gke_services": 0,
"cloud_run_services": 0
},
"estimated_savings_monthly": 52,
"safety_checks_passed": true,
"backup_manifests_created": false,
"token_usage": 4800,
"created_at": "2025-11-29T15:00:00Z"
}
Progress Notes (Markdown):
# GCP Resource Cleanup Progress - 2025-11-29
## Completed
- ✅ Scanned GKE namespace (coditect-app)
- ✅ Identified legacy V2 resources
- 2 deployments (coditect-api-v2, old-frontend)
- 2 services
- 8 Cloud Run services
- ✅ Estimated savings: $52/month
## In Progress
- Safety checks pending
- Backup manifest creation
## Next Actions
- Create backup manifests for all resources
- Dry-run deletion to verify safety
- Execute deletion if approved
- Monitor for broken dependencies
Session Recovery
When starting a fresh context window after cleanup work:
- Load Checkpoint State: Read
.coditect/checkpoints/gcp-cleanup-latest.json - Review Progress Notes: Check
cleanup-progress.mdfor scan results - Verify Resources Identified: Re-list resources to confirm still present
- Resume Deletion: Continue with pending deletions
- Validate Cleanup: Confirm resources deleted and no issues
Recovery Commands:
# 1. Check latest checkpoint
cat .coditect/checkpoints/gcp-cleanup-latest.json | jq '.resources_identified'
# 2. Review progress
tail -25 cleanup-progress.md
# 3. Verify resources still exist
kubectl get deployments -n coditect-app | grep -E "(v2|old)"
gcloud run services list --region=us-central1
# 4. Check backup manifests
ls -la .coditect/backups/cleanup-20251129/
# 5. Check deletion status
cat .coditect/checkpoints/gcp-cleanup-latest.json | jq '.resources_deleted'
State Management Best Practices
Checkpoint Files (JSON Schema):
- Store in
.coditect/checkpoints/gcp-cleanup-{scope}.json - Track resources identified vs deleted separately
- Record estimated cost savings for reporting
- Include safety check results
Progress Tracking (Markdown Narrative):
- Maintain
cleanup-progress.mdwith scan and deletion results - Document resources deleted with timestamps
- Note any errors or warnings during deletion
- List cost savings achieved
Git Integration:
- Save backup manifests to
.coditect/backups/cleanup-{date}/ - Create cleanup report in
.coditect/reports/cleanup-{date}.md - Tag cleanup operations:
git tag cleanup-{scope}-{date}
Progress Checkpoints
Natural Breaking Points:
- After resource scan complete
- After safety checks passed
- After backup manifests created
- After each resource type deleted (GKE, Cloud Run, Images)
- After cleanup validated
Checkpoint Creation Pattern:
# Automatic checkpoint creation after each deletion batch
if deletion_batch_complete or resources_deleted_count % 5 == 0:
create_checkpoint({
"cleanup_scope": scope,
"phase": current_phase,
"resources_identified": identified_counts,
"resources_deleted": deleted_counts,
"tokens": current_token_usage
})
Example: Multi-Context Cleanup
Context Window 1: Scan + Safety Checks
{
"checkpoint_id": "ckpt_cleanup_part1",
"phase": "safety_checks_complete",
"resources_identified": {
"gke": 4,
"cloud_run": 8,
"images": 15
},
"safety_checks_passed": true,
"backup_manifests_created": true,
"next_action": "Execute deletion",
"token_usage": 4800
}
Context Window 2: Execute Deletion + Validation
# Resume from checkpoint
cat .coditect/checkpoints/ckpt_cleanup_part1.json
# Continue with deletion
# (Context restored in 2 minutes vs 10 minutes from scratch)
{
"checkpoint_id": "ckpt_cleanup_complete",
"phase": "cleanup_validated",
"resources_deleted": {
"gke": 4,
"cloud_run": 8,
"images": 15
},
"estimated_savings_realized": 52,
"no_issues_detected": true,
"token_usage": 3200
}
Token Savings: 4800 (first context) + 3200 (second context) = 8000 total vs. 14000 without checkpoint = 43% reduction
Reference: See docs/CLAUDE-4.5-BEST-PRACTICES.md for complete multi-context window workflow guidance.