Incident Response
Common failure scenarios and how to diagnose and recover from them.
What this page covers
- TCP hang on a dead IP: diagnosis and fix
- Swap pressure: symptoms and mitigation
- Pod eviction: causes and recovery
- Service unreachable: checklist
- Certificate expiry: emergency renewal
TCP hang on dead IP
When a server IP becomes unreachable (e.g., a node goes down), existing TCP connections to it hang indefinitely until the kernel's TCP timeout expires (default: minutes to hours).
Symptoms:
kubectlcommands hang- SSH sessions to a node freeze
- Service health checks time out
Fix:
- Set
net.ipv4.tcp_keepalive_time,tcp_keepalive_intvl, andtcp_keepalive_probesto aggressive values for faster detection. - Ensure application-level keepalives are configured.
See TCP Hang on Dead IP for the full reference.
Swap pressure
If a node runs low on available memory, the kernel begins swapping. Heavy swap use severely degrades performance.
Symptoms:
- High
kswapdCPU usage - Pods becoming unresponsive
- OOMKill events in
dmesgorkubectl describe pod
Mitigation:
- Add swap if not present (see Using Swap in Linux)
- Set appropriate memory limits on pods
- Use Goldilocks to right-size resource requests
Pod eviction
K3s evicts pods when a node runs out of disk or memory. Evicted pods do not restart automatically unless managed by a Deployment.
# Find evicted pods
kubectl get pods -A | grep Evicted
# Delete evicted pods
kubectl get pods -A | grep Evicted | awk '{print "kubectl delete pod " $2 " -n " $1}' | sh