Kubernetes Troubleshooting Skill

When to Use This Skill

Use this skill when implementing kubernetes troubleshooting patterns in your codebase.

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Comprehensive debugging and troubleshooting patterns for Kubernetes clusters including pod failures, networking issues, resource constraints, and cluster health.

Quick Diagnostics

Cluster Health Check

# Overall cluster status
kubectl cluster-info
kubectl get nodes -o wide
kubectl get componentstatuses

# Node conditions
kubectl describe nodes | grep -A5 "Conditions:"

# Cluster events (last hour)
kubectl get events --sort-by='.lastTimestamp' -A | tail -50

Namespace Overview

# Pod status summary
kubectl get pods -n <namespace> -o wide

# Resource utilization
kubectl top pods -n <namespace>
kubectl top nodes

# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Pod Troubleshooting

Pod Not Starting

ImagePullBackOff

# Check image pull errors
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events:"

# Verify image exists
docker pull <image>

# Check imagePullSecrets
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.imagePullSecrets}'

# Verify secret exists
kubectl get secret <secret> -n <namespace>

CrashLoopBackOff

# Check container logs
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -n <namespace> -c <container>

# Check exit code
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State:"

# Common exit codes:
# 0   - Success (but restarted anyway - check liveness probe)
# 1   - Application error
# 137 - OOMKilled (128 + 9)
# 143 - SIGTERM received (128 + 15)

Pending State

# Check scheduling issues
kubectl describe pod <pod> -n <namespace> | grep -A20 "Events:"

# Common causes:
# - Insufficient resources
kubectl describe nodes | grep -A5 "Allocated resources:"

# - Node selector mismatch
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels

# - Taints and tolerations
kubectl describe nodes | grep -A3 "Taints:"

# - PVC not bound
kubectl get pvc -n <namespace>

Container Debugging

Exec Into Container

# Interactive shell
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl exec -it <pod> -n <namespace> -c <container> -- /bin/bash

# Run specific command
kubectl exec <pod> -n <namespace> -- cat /etc/config/app.yaml

Debug Container (Ephemeral)

# Add debug container to running pod
kubectl debug -it <pod> -n <namespace> --image=busybox --target=<container>

# Debug with full toolkit
kubectl debug -it <pod> -n <namespace> --image=nicolaka/netshoot --target=<container>

Debug Node

# Create privileged debug pod on node
kubectl debug node/<node> -it --image=busybox

# Access node filesystem
chroot /host

Log Analysis

Pod Logs

# Current logs
kubectl logs <pod> -n <namespace>

# Previous container logs
kubectl logs <pod> -n <namespace> --previous

# Follow logs
kubectl logs <pod> -n <namespace> -f

# Last N lines
kubectl logs <pod> -n <namespace> --tail=100

# Since timestamp
kubectl logs <pod> -n <namespace> --since=1h
kubectl logs <pod> -n <namespace> --since-time='2024-01-01T00:00:00Z'

Multi-Pod Logs with Stern

# Install stern
brew install stern

# Logs from all pods matching pattern
stern <pod-pattern> -n <namespace>

# Logs from specific container
stern <pod-pattern> -n <namespace> -c <container>

# With timestamps
stern <pod-pattern> -n <namespace> -t

# Exclude patterns
stern <pod-pattern> -n <namespace> --exclude "health|ready"

Aggregated Logging

# All pods in deployment
kubectl logs -l app=<app-name> -n <namespace> --all-containers

# Export to file
kubectl logs <pod> -n <namespace> > pod-logs.txt

Networking Troubleshooting

DNS Issues

# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test service DNS
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local

Service Connectivity

# Check service endpoints
kubectl get endpoints <service> -n <namespace>

# Verify service selector matches pods
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> -l <label-selector>

# Test service from within cluster
kubectl run curl-test --rm -it --image=curlimages/curl --restart=Never -- curl http://<service>.<namespace>:port

Network Policies

# List network policies
kubectl get networkpolicies -n <namespace>

# Describe policy
kubectl describe networkpolicy <policy> -n <namespace>

# Test connectivity with netshoot
kubectl run netshoot --rm -it --image=nicolaka/netshoot --restart=Never -- bash
# Inside: curl, ping, nc -zv <host> <port>

Ingress Debugging

# Check ingress status
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress> -n <namespace>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc <backend-service> -n <namespace>

Resource Constraints

OOMKilled

# Check memory limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"

# Check actual usage
kubectl top pod <pod> -n <namespace>

# View OOM events
kubectl get events -n <namespace> --field-selector reason=OOMKilled

CPU Throttling

# Check CPU limits
kubectl describe pod <pod> -n <namespace> | grep -A5 "Limits:"

# Prometheus query for throttling
# container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total

# Solution: Increase CPU limits or optimize application

Resource Requests Optimization

# View actual resource usage
kubectl top pods -n <namespace> --containers

# Compare with requests/limits
kubectl get pods -n <namespace> -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory"

Storage Troubleshooting

PVC Issues

# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc> -n <namespace>

# Check PV
kubectl get pv
kubectl describe pv <pv>

# Check storage class
kubectl get storageclass

# Common issues:
# - StorageClass not found
# - Provisioner not available
# - Insufficient capacity

Volume Mount Issues

# Check mount status in pod
kubectl exec <pod> -n <namespace> -- df -h
kubectl exec <pod> -n <namespace> -- mount | grep <volume-path>

# Check volume configuration
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.volumes}'
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].volumeMounts}'

Permission Issues

# Check container user
kubectl exec <pod> -n <namespace> -- id

# Check volume permissions
kubectl exec <pod> -n <namespace> -- ls -la <volume-path>

# Fix with securityContext
# spec:
#   securityContext:
#     fsGroup: 1000
#   containers:
#   - securityContext:
#       runAsUser: 1000
#       runAsGroup: 1000

Probe Failures

Liveness Probe Failures

# Check probe configuration
kubectl describe pod <pod> -n <namespace> | grep -A10 "Liveness:"

# Test probe endpoint manually
kubectl exec <pod> -n <namespace> -- curl -v localhost:8080/health

# Common issues:
# - Endpoint not responding
# - Initial delay too short
# - Timeout too aggressive

Readiness Probe Failures

# Check readiness
kubectl describe pod <pod> -n <namespace> | grep -A10 "Readiness:"

# Pod not receiving traffic if readiness fails
kubectl get endpoints <service> -n <namespace>

Probe Configuration Example

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

Deployment Issues

Rolling Update Stuck

# Check rollout status
kubectl rollout status deployment/<deployment> -n <namespace>

# Check deployment conditions
kubectl describe deployment <deployment> -n <namespace> | grep -A10 "Conditions:"

# Check replicaset
kubectl get rs -n <namespace> -l app=<app>

# Force rollback
kubectl rollout undo deployment/<deployment> -n <namespace>

Scaling Issues

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa> -n <namespace>

# Check metrics server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes

# Manual scale for testing
kubectl scale deployment <deployment> -n <namespace> --replicas=3

Useful Tools

k9s - Terminal UI

# Install
brew install k9s

# Launch
k9s

# Keyboard shortcuts:
# :pod - switch to pods view
# :svc - switch to services view
# :deploy - switch to deployments view
# /pattern - filter
# l - logs
# d - describe
# e - edit
# ctrl+k - kill

kubectx/kubens

# Install
brew install kubectx

# Switch context
kubectx <context>

# Switch namespace
kubens <namespace>

kube-capacity

# Install
kubectl krew install resource-capacity

# View cluster capacity
kubectl resource-capacity

# With utilization
kubectl resource-capacity --util

Diagnostic Scripts

Full Pod Diagnostic

#!/bin/bash
POD=$1
NS=${2:-default}

echo "=== Pod Description ==="
kubectl describe pod $POD -n $NS

echo "=== Container Logs ==="
kubectl logs $POD -n $NS --all-containers

echo "=== Previous Logs ==="
kubectl logs $POD -n $NS --previous 2>/dev/null || echo "No previous logs"

echo "=== Events ==="
kubectl get events -n $NS --field-selector involvedObject.name=$POD

Cluster Health Report

#!/bin/bash
echo "=== Node Status ==="
kubectl get nodes -o wide

echo "=== Pod Status by Namespace ==="
kubectl get pods -A --field-selector status.phase!=Running

echo "=== Recent Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

echo "=== Resource Utilization ==="
kubectl top nodes

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: kubernetes-troubleshooting

Completed:
- [x] Root cause identified
- [x] Diagnostic commands executed
- [x] Logs analyzed and key errors extracted
- [x] Fix applied and verified
- [x] Monitoring configured to prevent recurrence
- [x] Documentation updated with solution

Outputs:
- Root cause analysis report
- Diagnostic command output
- Fix commands/manifests applied
- Monitoring/alerting configuration
- Runbook entry for future reference

Verification:
- Pod/deployment in healthy state
- All containers running without restarts
- Service endpoints healthy
- Application responding correctly
- Resource utilization normal

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Root cause not identified, only symptoms addressed
❌ Pod still in CrashLoopBackOff or Pending state
❌ Network connectivity still failing
❌ Resource constraints not resolved
❌ Logs show continuing errors
❌ Fix causes new issues in other components
❌ No monitoring to prevent recurrence
❌ Issue returns immediately after "fix"
❌ Destructive action taken without backup

When NOT to Use

Do NOT use this skill when:

Kubernetes cluster not yet deployed (use setup guides first)
Application code bugs (use application debugging instead)
Infrastructure provisioning issues (use cloud-infrastructure-patterns)
Non-Kubernetes container issues (use docker-troubleshooting)
Security incidents (use security-incident-response)
Alternative: Use vendor support for managed Kubernetes issues

Use alternative approaches for:

Initial cluster setup → kubernetes-setup skill
Application profiling → application-profiling skill
Security audits → kubernetes-security-audit skill
Capacity planning → kubernetes-capacity-planning skill
CI/CD pipeline issues → cicd-troubleshooting skill

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
kubectl delete pod to "fix" crash	Doesn't address root cause	Diagnose with logs/describe first
Increasing resources without profiling	Wastes money, masks real issue	Profile actual usage, optimize code
No log collection before restart	Lose diagnostic information	Always check logs BEFORE restarting
Editing running pods directly	Changes lost on restart	Edit deployment/statefulset YAML
Force deleting pods	Can cause data loss	Use graceful termination
Ignoring node conditions	Node issues affect all pods	Check node health first
No resource limits	OOMKilled or noisy neighbors	Set requests and limits
Skipping RBAC checks	Permission issues hard to debug	Verify service account permissions
No backup before changes	Cannot rollback	Always snapshot before risky operations
Troubleshooting in production first	User impact	Reproduce in staging when possible

Principles

This skill embodies these CODITECT principles:

#8 No Assumptions - Verify with diagnostic commands, don't guess
#6 Clear, Understandable, Explainable - Document root cause and solution
#10 Iterative Refinement - Narrow down issue systematically
#5 Eliminate Ambiguity - Precise error identification, not vague symptoms
Safety First - No destructive actions without backups
Prevention - Configure monitoring to catch issues early

Related Standards:

Usage Examples

Debug Crashing Pod

Apply kubernetes-troubleshooting skill to diagnose CrashLoopBackOff in api-server pod with exit code 137

Network Connectivity Issue

Apply kubernetes-troubleshooting skill to troubleshoot service-to-service communication failure between frontend and backend services

Cluster Performance Analysis

Apply kubernetes-troubleshooting skill to analyze cluster resource utilization and identify bottlenecks

When to Use This Skill​

How to Use This Skill​

Quick Diagnostics​

Cluster Health Check​

Namespace Overview​

Pod Troubleshooting​

Pod Not Starting​

ImagePullBackOff​

CrashLoopBackOff​

Pending State​

Container Debugging​

Exec Into Container​

Debug Container (Ephemeral)​

Debug Node​

Log Analysis​

Pod Logs​

Multi-Pod Logs with Stern​

Aggregated Logging​

Networking Troubleshooting​

DNS Issues​

Service Connectivity​

Network Policies​

Ingress Debugging​

Resource Constraints​

OOMKilled​

CPU Throttling​

Resource Requests Optimization​

Storage Troubleshooting​

PVC Issues​

Volume Mount Issues​

Permission Issues​

Probe Failures​

Liveness Probe Failures​

Readiness Probe Failures​

Probe Configuration Example​

Deployment Issues​

Rolling Update Stuck​

Scaling Issues​

Useful Tools​

k9s - Terminal UI​

kubectx/kubens​

kube-capacity​

Diagnostic Scripts​

Full Pod Diagnostic​

Cluster Health Report​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Usage Examples​

Debug Crashing Pod​

Network Connectivity Issue​

Cluster Performance Analysis​