Skip to main content

Kubernetes Troubleshooting Skill

Kubernetes Troubleshooting Skill

When to Use This Skill

Use this skill when implementing kubernetes troubleshooting patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Comprehensive debugging and troubleshooting patterns for Kubernetes clusters including pod failures, networking issues, resource constraints, and cluster health.

Quick Diagnostics

Cluster Health Check

# Overall cluster status
kubectl cluster-info
kubectl get nodes -o wide
kubectl get componentstatuses

# Node conditions
kubectl describe nodes | grep -A5 "Conditions:"

# Cluster events (last hour)
kubectl get events --sort-by='.lastTimestamp' -A | tail -50

Namespace Overview

# Pod status summary
kubectl get pods -n <namespace> -o wide

# Resource utilization
kubectl top pods -n <namespace>
kubectl top nodes

# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Pod Troubleshooting

Pod Not Starting

ImagePullBackOff

# Check image pull errors
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events:"

# Verify image exists
docker pull <image>

# Check imagePullSecrets
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.imagePullSecrets}'

# Verify secret exists
kubectl get secret <secret> -n <namespace>

CrashLoopBackOff

# Check container logs
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -n <namespace> -c <container>

# Check exit code
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State:"

# Common exit codes:
# 0 - Success (but restarted anyway - check liveness probe)
# 1 - Application error
# 137 - OOMKilled (128 + 9)
# 143 - SIGTERM received (128 + 15)

Pending State

# Check scheduling issues
kubectl describe pod <pod> -n <namespace> | grep -A20 "Events:"

# Common causes:
# - Insufficient resources
kubectl describe nodes | grep -A5 "Allocated resources:"

# - Node selector mismatch
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels

# - Taints and tolerations
kubectl describe nodes | grep -A3 "Taints:"

# - PVC not bound
kubectl get pvc -n <namespace>

Container Debugging

Exec Into Container

# Interactive shell
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl exec -it <pod> -n <namespace> -c <container> -- /bin/bash

# Run specific command
kubectl exec <pod> -n <namespace> -- cat /etc/config/app.yaml

Debug Container (Ephemeral)

# Add debug container to running pod
kubectl debug -it <pod> -n <namespace> --image=busybox --target=<container>

# Debug with full toolkit
kubectl debug -it <pod> -n <namespace> --image=nicolaka/netshoot --target=<container>

Debug Node

# Create privileged debug pod on node
kubectl debug node/<node> -it --image=busybox

# Access node filesystem
chroot /host

Log Analysis

Pod Logs

# Current logs
kubectl logs <pod> -n <namespace>

# Previous container logs
kubectl logs <pod> -n <namespace> --previous

# Follow logs
kubectl logs <pod> -n <namespace> -f

# Last N lines
kubectl logs <pod> -n <namespace> --tail=100

# Since timestamp
kubectl logs <pod> -n <namespace> --since=1h
kubectl logs <pod> -n <namespace> --since-time='2024-01-01T00:00:00Z'

Multi-Pod Logs with Stern

# Install stern
brew install stern

# Logs from all pods matching pattern
stern <pod-pattern> -n <namespace>

# Logs from specific container
stern <pod-pattern> -n <namespace> -c <container>

# With timestamps
stern <pod-pattern> -n <namespace> -t

# Exclude patterns
stern <pod-pattern> -n <namespace> --exclude "health|ready"

Aggregated Logging

# All pods in deployment
kubectl logs -l app=<app-name> -n <namespace> --all-containers

# Export to file
kubectl logs <pod> -n <namespace> > pod-logs.txt

Networking Troubleshooting

DNS Issues

# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test service DNS
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local

Service Connectivity

# Check service endpoints
kubectl get endpoints <service> -n <namespace>

# Verify service selector matches pods
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> -l <label-selector>

# Test service from within cluster
kubectl run curl-test --rm -it --image=curlimages/curl --restart=Never -- curl http://<service>.<namespace>:port

Network Policies

# List network policies
kubectl get networkpolicies -n <namespace>

# Describe policy
kubectl describe networkpolicy <policy> -n <namespace>

# Test connectivity with netshoot
kubectl run netshoot --rm -it --image=nicolaka/netshoot --restart=Never -- bash
# Inside: curl, ping, nc -zv <host> <port>

Ingress Debugging

# Check ingress status
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress> -n <namespace>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc <backend-service> -n <namespace>

Resource Constraints

OOMKilled

# Check memory limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"

# Check actual usage
kubectl top pod <pod> -n <namespace>

# View OOM events
kubectl get events -n <namespace> --field-selector reason=OOMKilled

CPU Throttling

# Check CPU limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"

# Prometheus query for throttling
# container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total

# Solution: Increase CPU limits or optimize application

Resource Requests Optimization

# View actual resource usage
kubectl top pods -n <namespace> --containers

# Compare with requests/limits
kubectl get pods -n <namespace> -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory"

Storage Troubleshooting

PVC Issues

# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc> -n <namespace>

# Check PV
kubectl get pv
kubectl describe pv <pv>

# Check storage class
kubectl get storageclass

# Common issues:
# - StorageClass not found
# - Provisioner not available
# - Insufficient capacity

Volume Mount Issues

# Check mount status in pod
kubectl exec <pod> -n <namespace> -- df -h
kubectl exec <pod> -n <namespace> -- mount | grep <volume-path>

# Check volume configuration
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.volumes}'
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].volumeMounts}'

Permission Issues

# Check container user
kubectl exec <pod> -n <namespace> -- id

# Check volume permissions
kubectl exec <pod> -n <namespace> -- ls -la <volume-path>

# Fix with securityContext
# spec:
# securityContext:
# fsGroup: 1000
# containers:
# - securityContext:
# runAsUser: 1000
# runAsGroup: 1000

Probe Failures

Liveness Probe Failures

# Check probe configuration
kubectl describe pod <pod> -n <namespace> | grep -A10 "Liveness:"

# Test probe endpoint manually
kubectl exec <pod> -n <namespace> -- curl -v localhost:8080/health

# Common issues:
# - Endpoint not responding
# - Initial delay too short
# - Timeout too aggressive

Readiness Probe Failures

# Check readiness
kubectl describe pod <pod> -n <namespace> | grep -A10 "Readiness:"

# Pod not receiving traffic if readiness fails
kubectl get endpoints <service> -n <namespace>

Probe Configuration Example

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

Deployment Issues

Rolling Update Stuck

# Check rollout status
kubectl rollout status deployment/<deployment> -n <namespace>

# Check deployment conditions
kubectl describe deployment <deployment> -n <namespace> | grep -A10 "Conditions:"

# Check replicaset
kubectl get rs -n <namespace> -l app=<app>

# Force rollback
kubectl rollout undo deployment/<deployment> -n <namespace>

Scaling Issues

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa> -n <namespace>

# Check metrics server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes

# Manual scale for testing
kubectl scale deployment <deployment> -n <namespace> --replicas=3

Useful Tools

k9s - Terminal UI

# Install
brew install k9s

# Launch
k9s

# Keyboard shortcuts:
# :pod - switch to pods view
# :svc - switch to services view
# :deploy - switch to deployments view
# /pattern - filter
# l - logs
# d - describe
# e - edit
# ctrl+k - kill

kubectx/kubens

# Install
brew install kubectx

# Switch context
kubectx <context>

# Switch namespace
kubens <namespace>

kube-capacity

# Install
kubectl krew install resource-capacity

# View cluster capacity
kubectl resource-capacity

# With utilization
kubectl resource-capacity --util

Diagnostic Scripts

Full Pod Diagnostic

#!/bin/bash
POD=$1
NS=${2:-default}

echo "=== Pod Description ==="
kubectl describe pod $POD -n $NS

echo "=== Container Logs ==="
kubectl logs $POD -n $NS --all-containers

echo "=== Previous Logs ==="
kubectl logs $POD -n $NS --previous 2>/dev/null || echo "No previous logs"

echo "=== Events ==="
kubectl get events -n $NS --field-selector involvedObject.name=$POD

Cluster Health Report

#!/bin/bash
echo "=== Node Status ==="
kubectl get nodes -o wide

echo "=== Pod Status by Namespace ==="
kubectl get pods -A --field-selector status.phase!=Running

echo "=== Recent Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

echo "=== Resource Utilization ==="
kubectl top nodes

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: kubernetes-troubleshooting

Completed:
- [x] Root cause identified
- [x] Diagnostic commands executed
- [x] Logs analyzed and key errors extracted
- [x] Fix applied and verified
- [x] Monitoring configured to prevent recurrence
- [x] Documentation updated with solution

Outputs:
- Root cause analysis report
- Diagnostic command output
- Fix commands/manifests applied
- Monitoring/alerting configuration
- Runbook entry for future reference

Verification:
- Pod/deployment in healthy state
- All containers running without restarts
- Service endpoints healthy
- Application responding correctly
- Resource utilization normal

Completion Checklist

Before marking this skill as complete, verify:

  • Problem clearly identified (pod crash, network, resource, etc.)
  • Diagnostic commands executed and output captured
  • Logs analyzed for error patterns
  • Root cause determined (not just symptoms)
  • Fix applied and tested
  • Application functionality verified
  • Resource metrics normal (CPU, memory, network)
  • No error events in last 15 minutes
  • Monitoring/alerting configured for early detection
  • Runbook documentation updated

Failure Indicators

This skill has FAILED if:

  • ❌ Root cause not identified, only symptoms addressed
  • ❌ Pod still in CrashLoopBackOff or Pending state
  • ❌ Network connectivity still failing
  • ❌ Resource constraints not resolved
  • ❌ Logs show continuing errors
  • ❌ Fix causes new issues in other components
  • ❌ No monitoring to prevent recurrence
  • ❌ Issue returns immediately after "fix"
  • ❌ Destructive action taken without backup

When NOT to Use

Do NOT use this skill when:

  • Kubernetes cluster not yet deployed (use setup guides first)
  • Application code bugs (use application debugging instead)
  • Infrastructure provisioning issues (use cloud-infrastructure-patterns)
  • Non-Kubernetes container issues (use docker-troubleshooting)
  • Security incidents (use security-incident-response)
  • Alternative: Use vendor support for managed Kubernetes issues

Use alternative approaches for:

  • Initial cluster setup → kubernetes-setup skill
  • Application profiling → application-profiling skill
  • Security audits → kubernetes-security-audit skill
  • Capacity planning → kubernetes-capacity-planning skill
  • CI/CD pipeline issues → cicd-troubleshooting skill

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
kubectl delete pod to "fix" crashDoesn't address root causeDiagnose with logs/describe first
Increasing resources without profilingWastes money, masks real issueProfile actual usage, optimize code
No log collection before restartLose diagnostic informationAlways check logs BEFORE restarting
Editing running pods directlyChanges lost on restartEdit deployment/statefulset YAML
Force deleting podsCan cause data lossUse graceful termination
Ignoring node conditionsNode issues affect all podsCheck node health first
No resource limitsOOMKilled or noisy neighborsSet requests and limits
Skipping RBAC checksPermission issues hard to debugVerify service account permissions
No backup before changesCannot rollbackAlways snapshot before risky operations
Troubleshooting in production firstUser impactReproduce in staging when possible

Principles

This skill embodies these CODITECT principles:

  • #8 No Assumptions - Verify with diagnostic commands, don't guess
  • #6 Clear, Understandable, Explainable - Document root cause and solution
  • #10 Iterative Refinement - Narrow down issue systematically
  • #5 Eliminate Ambiguity - Precise error identification, not vague symptoms
  • Safety First - No destructive actions without backups
  • Prevention - Configure monitoring to catch issues early

Related Standards:

Usage Examples

Debug Crashing Pod

Apply kubernetes-troubleshooting skill to diagnose CrashLoopBackOff in api-server pod with exit code 137

Network Connectivity Issue

Apply kubernetes-troubleshooting skill to troubleshoot service-to-service communication failure between frontend and backend services

Cluster Performance Analysis

Apply kubernetes-troubleshooting skill to analyze cluster resource utilization and identify bottlenecks