Incident Response

Common failure scenarios and how to diagnose and recover from them.

What this page covers

TCP hang on a dead IP: diagnosis and fix
Swap pressure: symptoms and mitigation
Pod eviction: causes and recovery
Service unreachable: checklist
Certificate expiry: emergency renewal

TCP hang on dead IP

When a server IP becomes unreachable (e.g., a node goes down), existing TCP connections to it hang indefinitely until the kernel's TCP timeout expires (default: minutes to hours).

Symptoms:

kubectl commands hang
SSH sessions to a node freeze
Service health checks time out

Fix:

Set net.ipv4.tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes to aggressive values for faster detection.
Ensure application-level keepalives are configured.

See TCP Hang on Dead IP for the full reference.

Swap pressure

If a node runs low on available memory, the kernel begins swapping. Heavy swap use severely degrades performance.

Symptoms:

High kswapd CPU usage
Pods becoming unresponsive
OOMKill events in dmesg or kubectl describe pod

Mitigation:

Add swap if not present (see Using Swap in Linux)
Set appropriate memory limits on pods
Use Goldilocks to right-size resource requests

Pod eviction

K3s evicts pods when a node runs out of disk or memory. Evicted pods do not restart automatically unless managed by a Deployment.

# Find evicted pods
kubectl get pods -A | grep Evicted

# Delete evicted pods
kubectl get pods -A | grep Evicted | awk '{print "kubectl delete pod " $2 " -n " $1}' | sh

Table of Contents

Incident Response

What this page covers

TCP hang on dead IP

Swap pressure

Pod eviction

Related reference docs