Table of Contents

Incident Response

Common failure scenarios and how to diagnose and recover from them.

What this page covers

  • TCP hang on a dead IP: diagnosis and fix
  • Swap pressure: symptoms and mitigation
  • Pod eviction: causes and recovery
  • Service unreachable: checklist
  • Certificate expiry: emergency renewal

TCP hang on dead IP

When a server IP becomes unreachable (e.g., a node goes down), existing TCP connections to it hang indefinitely until the kernel's TCP timeout expires (default: minutes to hours).

Symptoms:

  • kubectl commands hang
  • SSH sessions to a node freeze
  • Service health checks time out

Fix:

  • Set net.ipv4.tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes to aggressive values for faster detection.
  • Ensure application-level keepalives are configured.

See TCP Hang on Dead IP for the full reference.

Swap pressure

If a node runs low on available memory, the kernel begins swapping. Heavy swap use severely degrades performance.

Symptoms:

  • High kswapd CPU usage
  • Pods becoming unresponsive
  • OOMKill events in dmesg or kubectl describe pod

Mitigation:

  • Add swap if not present (see Using Swap in Linux)
  • Set appropriate memory limits on pods
  • Use Goldilocks to right-size resource requests

Pod eviction

K3s evicts pods when a node runs out of disk or memory. Evicted pods do not restart automatically unless managed by a Deployment.

# Find evicted pods
kubectl get pods -A | grep Evicted

# Delete evicted pods
kubectl get pods -A | grep Evicted | awk '{print "kubectl delete pod " $2 " -n " $1}' | sh