Patching a kernel privilege-escalation CVE across 47 servers with zero downtime

A privilege-escalation CVE landed in the Linux kernel and our security team flagged it as fleet-wide, exploitable, and not something we could sit on. The fleet in question: 47 servers across staging and production, running a mix of bare workloads and Kubernetes nodes.

The constraint that shaped everything: zero downtime. No maintenance window, no user-facing blip.

The plan

Triage first. Not every box was equally exposed — we split the fleet into tiers by exposure and blast radius.
Canary the patch. One low-traffic node first, soak for a few hours, watch metrics.
Roll node-by-node, draining Kubernetes workloads before reboot and letting the scheduler redistribute pods.
Reboot in waves, never more than one node per availability zone at a time.

What actually mattered

The scripting was the easy part. The two things that made this boring instead of dramatic:

Cordoning and draining before reboot — kubectl cordon + kubectl drain with a sane --grace-period, verified pod rescheduling before touching the kernel.
Health checks between every wave — Prometheus alerts on pod restarts, node readiness, and request error rate gated whether the rollout continued.

If a wave didn’t come back green within its soak window, the rollout paused automatically. No one had to be watching a terminal at 2am for it to be safe.

Result

All 47 servers patched, zero unplanned outages, and a documented runbook for the next kernel CVE — because there will be a next one.