2026-05-14
Patching a kernel privilege-escalation CVE across 47 servers with zero downtime
A privilege-escalation CVE landed in the Linux kernel and our security team flagged it as fleet-wide, exploitable, and not something we could sit on. The fleet in question: 47 servers across staging and production, running a mix of bare workloads and Kubernetes nodes.
The constraint that shaped everything: zero downtime. No maintenance window, no user-facing blip.
The plan
- Triage first. Not every box was equally exposed — we split the fleet into tiers by exposure and blast radius.
- Canary the patch. One low-traffic node first, soak for a few hours, watch metrics.
- Roll node-by-node, draining Kubernetes workloads before reboot and letting the scheduler redistribute pods.
- Reboot in waves, never more than one node per availability zone at a time.
What actually mattered
The scripting was the easy part. The two things that made this boring instead of dramatic:
- Cordoning and draining before reboot —
kubectl cordon+kubectl drainwith a sane--grace-period, verified pod rescheduling before touching the kernel. - Health checks between every wave — Prometheus alerts on pod restarts, node readiness, and request error rate gated whether the rollout continued.
If a wave didn’t come back green within its soak window, the rollout paused automatically. No one had to be watching a terminal at 2am for it to be safe.
Result
All 47 servers patched, zero unplanned outages, and a documented runbook for the next kernel CVE — because there will be a next one.