Kubernetes Troubleshooting Skill
Kubernetes Troubleshooting Skill
When to Use This Skill
Use this skill when implementing kubernetes troubleshooting patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Comprehensive debugging and troubleshooting patterns for Kubernetes clusters including pod failures, networking issues, resource constraints, and cluster health.
Quick Diagnostics
Cluster Health Check
# Overall cluster status
kubectl cluster-info
kubectl get nodes -o wide
kubectl get componentstatuses
# Node conditions
kubectl describe nodes | grep -A5 "Conditions:"
# Cluster events (last hour)
kubectl get events --sort-by='.lastTimestamp' -A | tail -50
Namespace Overview
# Pod status summary
kubectl get pods -n <namespace> -o wide
# Resource utilization
kubectl top pods -n <namespace>
kubectl top nodes
# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Pod Troubleshooting
Pod Not Starting
ImagePullBackOff
# Check image pull errors
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events:"
# Verify image exists
docker pull <image>
# Check imagePullSecrets
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.imagePullSecrets}'
# Verify secret exists
kubectl get secret <secret> -n <namespace>
CrashLoopBackOff
# Check container logs
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -n <namespace> -c <container>
# Check exit code
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State:"
# Common exit codes:
# 0 - Success (but restarted anyway - check liveness probe)
# 1 - Application error
# 137 - OOMKilled (128 + 9)
# 143 - SIGTERM received (128 + 15)
Pending State
# Check scheduling issues
kubectl describe pod <pod> -n <namespace> | grep -A20 "Events:"
# Common causes:
# - Insufficient resources
kubectl describe nodes | grep -A5 "Allocated resources:"
# - Node selector mismatch
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels
# - Taints and tolerations
kubectl describe nodes | grep -A3 "Taints:"
# - PVC not bound
kubectl get pvc -n <namespace>
Container Debugging
Exec Into Container
# Interactive shell
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl exec -it <pod> -n <namespace> -c <container> -- /bin/bash
# Run specific command
kubectl exec <pod> -n <namespace> -- cat /etc/config/app.yaml
Debug Container (Ephemeral)
# Add debug container to running pod
kubectl debug -it <pod> -n <namespace> --image=busybox --target=<container>
# Debug with full toolkit
kubectl debug -it <pod> -n <namespace> --image=nicolaka/netshoot --target=<container>
Debug Node
# Create privileged debug pod on node
kubectl debug node/<node> -it --image=busybox
# Access node filesystem
chroot /host
Log Analysis
Pod Logs
# Current logs
kubectl logs <pod> -n <namespace>
# Previous container logs
kubectl logs <pod> -n <namespace> --previous
# Follow logs
kubectl logs <pod> -n <namespace> -f
# Last N lines
kubectl logs <pod> -n <namespace> --tail=100
# Since timestamp
kubectl logs <pod> -n <namespace> --since=1h
kubectl logs <pod> -n <namespace> --since-time='2024-01-01T00:00:00Z'
Multi-Pod Logs with Stern
# Install stern
brew install stern
# Logs from all pods matching pattern
stern <pod-pattern> -n <namespace>
# Logs from specific container
stern <pod-pattern> -n <namespace> -c <container>
# With timestamps
stern <pod-pattern> -n <namespace> -t
# Exclude patterns
stern <pod-pattern> -n <namespace> --exclude "health|ready"
Aggregated Logging
# All pods in deployment
kubectl logs -l app=<app-name> -n <namespace> --all-containers
# Export to file
kubectl logs <pod> -n <namespace> > pod-logs.txt
Networking Troubleshooting
DNS Issues
# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup kubernetes.default
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test service DNS
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local
Service Connectivity
# Check service endpoints
kubectl get endpoints <service> -n <namespace>
# Verify service selector matches pods
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> -l <label-selector>
# Test service from within cluster
kubectl run curl-test --rm -it --image=curlimages/curl --restart=Never -- curl http://<service>.<namespace>:port
Network Policies
# List network policies
kubectl get networkpolicies -n <namespace>
# Describe policy
kubectl describe networkpolicy <policy> -n <namespace>
# Test connectivity with netshoot
kubectl run netshoot --rm -it --image=nicolaka/netshoot --restart=Never -- bash
# Inside: curl, ping, nc -zv <host> <port>
Ingress Debugging
# Check ingress status
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress> -n <namespace>
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Verify backend service
kubectl get svc <backend-service> -n <namespace>
Resource Constraints
OOMKilled
# Check memory limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"
# Check actual usage
kubectl top pod <pod> -n <namespace>
# View OOM events
kubectl get events -n <namespace> --field-selector reason=OOMKilled
CPU Throttling
# Check CPU limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"
# Prometheus query for throttling
# container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total
# Solution: Increase CPU limits or optimize application
Resource Requests Optimization
# View actual resource usage
kubectl top pods -n <namespace> --containers
# Compare with requests/limits
kubectl get pods -n <namespace> -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory"
Storage Troubleshooting
PVC Issues
# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc> -n <namespace>
# Check PV
kubectl get pv
kubectl describe pv <pv>
# Check storage class
kubectl get storageclass
# Common issues:
# - StorageClass not found
# - Provisioner not available
# - Insufficient capacity
Volume Mount Issues
# Check mount status in pod
kubectl exec <pod> -n <namespace> -- df -h
kubectl exec <pod> -n <namespace> -- mount | grep <volume-path>
# Check volume configuration
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.volumes}'
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].volumeMounts}'
Permission Issues
# Check container user
kubectl exec <pod> -n <namespace> -- id
# Check volume permissions
kubectl exec <pod> -n <namespace> -- ls -la <volume-path>
# Fix with securityContext
# spec:
# securityContext:
# fsGroup: 1000
# containers:
# - securityContext:
# runAsUser: 1000
# runAsGroup: 1000
Probe Failures
Liveness Probe Failures
# Check probe configuration
kubectl describe pod <pod> -n <namespace> | grep -A10 "Liveness:"
# Test probe endpoint manually
kubectl exec <pod> -n <namespace> -- curl -v localhost:8080/health
# Common issues:
# - Endpoint not responding
# - Initial delay too short
# - Timeout too aggressive
Readiness Probe Failures
# Check readiness
kubectl describe pod <pod> -n <namespace> | grep -A10 "Readiness:"
# Pod not receiving traffic if readiness fails
kubectl get endpoints <service> -n <namespace>
Probe Configuration Example
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Deployment Issues
Rolling Update Stuck
# Check rollout status
kubectl rollout status deployment/<deployment> -n <namespace>
# Check deployment conditions
kubectl describe deployment <deployment> -n <namespace> | grep -A10 "Conditions:"
# Check replicaset
kubectl get rs -n <namespace> -l app=<app>
# Force rollback
kubectl rollout undo deployment/<deployment> -n <namespace>
Scaling Issues
# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa> -n <namespace>
# Check metrics server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
# Manual scale for testing
kubectl scale deployment <deployment> -n <namespace> --replicas=3
Useful Tools
k9s - Terminal UI
# Install
brew install k9s
# Launch
k9s
# Keyboard shortcuts:
# :pod - switch to pods view
# :svc - switch to services view
# :deploy - switch to deployments view
# /pattern - filter
# l - logs
# d - describe
# e - edit
# ctrl+k - kill
kubectx/kubens
# Install
brew install kubectx
# Switch context
kubectx <context>
# Switch namespace
kubens <namespace>
kube-capacity
# Install
kubectl krew install resource-capacity
# View cluster capacity
kubectl resource-capacity
# With utilization
kubectl resource-capacity --util
Diagnostic Scripts
Full Pod Diagnostic
#!/bin/bash
POD=$1
NS=${2:-default}
echo "=== Pod Description ==="
kubectl describe pod $POD -n $NS
echo "=== Container Logs ==="
kubectl logs $POD -n $NS --all-containers
echo "=== Previous Logs ==="
kubectl logs $POD -n $NS --previous 2>/dev/null || echo "No previous logs"
echo "=== Events ==="
kubectl get events -n $NS --field-selector involvedObject.name=$POD
Cluster Health Report
#!/bin/bash
echo "=== Node Status ==="
kubectl get nodes -o wide
echo "=== Pod Status by Namespace ==="
kubectl get pods -A --field-selector status.phase!=Running
echo "=== Recent Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
echo "=== Resource Utilization ==="
kubectl top nodes
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: kubernetes-troubleshooting
Completed:
- [x] Root cause identified
- [x] Diagnostic commands executed
- [x] Logs analyzed and key errors extracted
- [x] Fix applied and verified
- [x] Monitoring configured to prevent recurrence
- [x] Documentation updated with solution
Outputs:
- Root cause analysis report
- Diagnostic command output
- Fix commands/manifests applied
- Monitoring/alerting configuration
- Runbook entry for future reference
Verification:
- Pod/deployment in healthy state
- All containers running without restarts
- Service endpoints healthy
- Application responding correctly
- Resource utilization normal
Completion Checklist
Before marking this skill as complete, verify:
- Problem clearly identified (pod crash, network, resource, etc.)
- Diagnostic commands executed and output captured
- Logs analyzed for error patterns
- Root cause determined (not just symptoms)
- Fix applied and tested
- Application functionality verified
- Resource metrics normal (CPU, memory, network)
- No error events in last 15 minutes
- Monitoring/alerting configured for early detection
- Runbook documentation updated
Failure Indicators
This skill has FAILED if:
- ❌ Root cause not identified, only symptoms addressed
- ❌ Pod still in CrashLoopBackOff or Pending state
- ❌ Network connectivity still failing
- ❌ Resource constraints not resolved
- ❌ Logs show continuing errors
- ❌ Fix causes new issues in other components
- ❌ No monitoring to prevent recurrence
- ❌ Issue returns immediately after "fix"
- ❌ Destructive action taken without backup
When NOT to Use
Do NOT use this skill when:
- Kubernetes cluster not yet deployed (use setup guides first)
- Application code bugs (use application debugging instead)
- Infrastructure provisioning issues (use cloud-infrastructure-patterns)
- Non-Kubernetes container issues (use docker-troubleshooting)
- Security incidents (use security-incident-response)
- Alternative: Use vendor support for managed Kubernetes issues
Use alternative approaches for:
- Initial cluster setup → kubernetes-setup skill
- Application profiling → application-profiling skill
- Security audits → kubernetes-security-audit skill
- Capacity planning → kubernetes-capacity-planning skill
- CI/CD pipeline issues → cicd-troubleshooting skill
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| kubectl delete pod to "fix" crash | Doesn't address root cause | Diagnose with logs/describe first |
| Increasing resources without profiling | Wastes money, masks real issue | Profile actual usage, optimize code |
| No log collection before restart | Lose diagnostic information | Always check logs BEFORE restarting |
| Editing running pods directly | Changes lost on restart | Edit deployment/statefulset YAML |
| Force deleting pods | Can cause data loss | Use graceful termination |
| Ignoring node conditions | Node issues affect all pods | Check node health first |
| No resource limits | OOMKilled or noisy neighbors | Set requests and limits |
| Skipping RBAC checks | Permission issues hard to debug | Verify service account permissions |
| No backup before changes | Cannot rollback | Always snapshot before risky operations |
| Troubleshooting in production first | User impact | Reproduce in staging when possible |
Principles
This skill embodies these CODITECT principles:
- #8 No Assumptions - Verify with diagnostic commands, don't guess
- #6 Clear, Understandable, Explainable - Document root cause and solution
- #10 Iterative Refinement - Narrow down issue systematically
- #5 Eliminate Ambiguity - Precise error identification, not vague symptoms
- Safety First - No destructive actions without backups
- Prevention - Configure monitoring to catch issues early
Related Standards:
Usage Examples
Debug Crashing Pod
Apply kubernetes-troubleshooting skill to diagnose CrashLoopBackOff in api-server pod with exit code 137
Network Connectivity Issue
Apply kubernetes-troubleshooting skill to troubleshoot service-to-service communication failure between frontend and backend services
Cluster Performance Analysis
Apply kubernetes-troubleshooting skill to analyze cluster resource utilization and identify bottlenecks