Mohanakrishnan
← Back to writing

2026-05-14

Patching a kernel privilege-escalation CVE across 47 servers with zero downtime

linuxkubernetescveincident-response

A privilege-escalation CVE landed in the Linux kernel and our security team flagged it as fleet-wide, exploitable, and not something we could sit on. The fleet in question: 47 servers across staging and production, running a mix of bare workloads and Kubernetes nodes.

The constraint that shaped everything: zero downtime. No maintenance window, no user-facing blip.

The plan

  1. Triage first. Not every box was equally exposed — we split the fleet into tiers by exposure and blast radius.
  2. Canary the patch. One low-traffic node first, soak for a few hours, watch metrics.
  3. Roll node-by-node, draining Kubernetes workloads before reboot and letting the scheduler redistribute pods.
  4. Reboot in waves, never more than one node per availability zone at a time.

What actually mattered

The scripting was the easy part. The two things that made this boring instead of dramatic:

  • Cordoning and draining before rebootkubectl cordon + kubectl drain with a sane --grace-period, verified pod rescheduling before touching the kernel.
  • Health checks between every wave — Prometheus alerts on pod restarts, node readiness, and request error rate gated whether the rollout continued.

If a wave didn’t come back green within its soak window, the rollout paused automatically. No one had to be watching a terminal at 2am for it to be safe.

Result

All 47 servers patched, zero unplanned outages, and a documented runbook for the next kernel CVE — because there will be a next one.