Midnight 'Swap Storm': An SRE's Memory Nightmare

For an SRE (Site Reliability Engineer), waking up in the middle of the night to a meaningless beep on your phone isn’t exactly a new experience. But there are alarms that pull you out of deep sleep and have you bolting upright in bed. These are the calls that tell you systems are at a standstill, users are locked out, and business continuity is on the line. One of those nights came for me, and I found myself dropped right into the middle of a “Swap Storm” nightmare.

In this post I want to tell the story of that Swap Storm, the technical details of what makes it so terrifying, why it turns into a nightmare, and how I worked through it. This isn’t just a technical guide — I want to focus on what these incidents do to an SRE’s life, and what we end up taking away from them.

What Is a “Swap Storm” and Why Is It a Nightmare?

The applications and services running on our systems all need memory (RAM), and there’s only so much of it. When physical memory runs out, the operating system writes the less-active memory pages to a special disk region called the “swap area.” This keeps applications running when RAM fills up — but it has a dark side, and that dark side is the Swap Storm.

A Swap Storm is what happens when the system is constantly shuttling memory pages between disk and RAM — i.e., swapping aggressively. Disk is far slower than RAM, and that nonstop data movement chews through CPU, drives I/O up, and grinds the system’s overall performance into the ground. That’s why a Swap Storm is a real nightmare for an SRE.

For users, this looks like slow response times, timeouts, and outright outages. For SREs, it looks like sleepless nights, stressful diagnostics, and pressure to fix it fast. Memory management is a complicated area, and finding the actual root cause of a Swap Storm can feel like searching for a needle in a haystack.

The Midnight Call: How It Started

It was around 3:00 AM. I was deep asleep when my phone went off with that uniquely uncomfortable alarm tone. PagerDuty, critical alert: “Service X — High Latency & Error Rate.” I rubbed my eyes, opened the laptop on reflex, and brought up the VPN. Initially I was just trying to figure out what we were even looking at — high CPU, full disk, network issues, something else?

When I pulled up the dashboards, every panel was red. Services weren’t responding, database connections were dropping, and worst of all, overall system performance had basically frozen. When you walk into something like this, your adrenaline shoots through the roof and you go straight into focus mode.

The Diagnostic Process: Hunting Down the Nightmare

Instead of panicking, I started working through the steps methodically. First step was SSH into one of the affected hosts. The moment I ran top, the output pointed me straight at Swap Storm: the wa (wait for I/O) percentage was through the roof, free memory was practically zero, and swap usage was maxed out.

The free -h output was rough:

              total        used        free      shared  buff/cache   available
Mem:           32Gi       31Gi        100Mi       1Gi         500Mi       100Mi
Swap:          16Gi       16Gi          0B

That meant the system had burned through all of physical memory and consumed all of swap as well. Constant disk reads and writes had slowed everything to a crawl, and that’s why services couldn’t respond. This state is also known as “thrashing” — the system is busy managing memory pages instead of doing any actual work.

Understanding Where the Memory Was Going

To understand why memory consumption was this high, I needed to dig deeper. Tools like top and htop show which processes are using the most memory. Metrics like RES (Resident Set Size) and VIRT (Virtual Memory Size) tell you how much physical memory a process is occupying and how much virtual memory it’s using overall.

But sometimes the problem isn’t a single big process — it’s the cumulative draw of many smaller ones, or memory consumed by the kernel itself (kernel slab cache, for example). In those cases, vmstat, sar, or reading /proc/meminfo and /proc/slabinfo directly will give you a much sharper picture.

The vmstat output shows a snapshot of memory and swap activity:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  16Gi   100Mi  200Mi  300Mi   1k   1k   500  1000  100  200 10 10 70 10  0

The si (swap in) and so (swap out) columns show how much memory was being written to and read from swap per second. During a Swap Storm those numbers stay very high. The high wa (I/O wait) value confirmed the CPU was sitting around waiting for I/O to complete.

Resolution Paths and First Response

Even though finding the root cause takes time, the priority right now was rescuing the system and getting services back up. The first response is usually the most obvious and fastest:

Stop or restart the heaviest memory consumers: Kill or restart the misbehaving processes you spotted in top or htop. This buys breathing room, but it’s a stop-gap.
Stop non-critical services: If multiple services are running on the host and some are less critical, stopping them temporarily can free memory for the ones that matter.
Add more resources (in cloud environments): If you’re on cloud infrastructure, scaling the host’s RAM up can be a fast fix — though it’s basically a “wider pipe” approach and doesn’t actually solve the underlying problem.

In my case, I noticed one specific application service had suddenly increased its memory usage. After a recent deploy, the application had started keeping large data sets pulled from the database in memory. It wasn’t quite a “memory leak,” but it was an inefficiency that caused way more memory consumption than expected.

As a temporary fix, I restarted that service. Memory was freed up and swap usage dropped back to normal. But that was just a bandage. A real fix needed more.

Permanent Fixes and What Comes Next

After an incident like a Swap Storm, you don’t just fix the immediate crisis — you put permanent solutions in place so it doesn’t happen again. That’s one of the most important parts of the SRE job: learn from the incident and harden the system.

Permanent fixes generally include:

Code optimization: Review the application code for memory leaks, inefficient data structures, or wasteful memory use, and fix them. In my case, the logic that handled large data sets in the application got optimized.
Resource limits and isolation:
- cgroups (Control Groups): On Linux, cgroups let you limit and isolate process resource usage (CPU, memory, I/O). This stops a single process from being able to take down the whole system.
- Kubernetes resource limits: For containerized applications, defining CPU and memory limits keeps any single pod from breaching a threshold.
Proactive monitoring and alerting: Set up tighter monitors that warn you when memory usage crosses a threshold (say, 80% RAM, or any rise in swap usage). That gives you a chance to step in before a Swap Storm gets going.
Performance testing: Before rolling out new features or major changes, run stress and load tests to make sure memory and overall performance behave as expected.
The swappiness setting: The Linux kernel parameter swappiness controls how aggressively the system uses swap. Values run from 0 to 100. A low value (e.g., 10) means the system tries to use swap as little as possible. But changing this isn’t a casual move — tune it with care, based on actual system behavior.

Lessons From an SRE’s Life

That midnight experience taught me — and the team — some important things:

The value of proactive monitoring: Watching only service status isn’t enough. You have to monitor the underlying infrastructure metrics — CPU, memory, disk I/O, network — at depth, so you can catch trouble early.
Understand the system’s internals: Solving a complex issue like a Swap Storm requires more than reading symptoms. You have to understand how the operating system manages memory under the hood. vmstat, sar, and the /proc filesystem are non-negotiable here.
Stress management and staying calm: A critical alert at midnight can rattle anyone. One of the most important traits of an SRE is the ability to stay calm under pressure and work through problems methodically.
Documentation and runbooks: Good runbooks (step-by-step troubleshooting guides) and solid documentation are lifesavers when an incident is in progress.
Automation and post-mortems: To stop similar incidents from repeating, you build automation and run a thorough post-mortem after each incident, capturing what was learned. That’s the engine of continuous improvement.

Closing Thoughts

A midnight Swap Storm call is one of the most stressful moments any SRE will hit in their career. These incidents test more than just technical skill — they test problem-solving, stress management, teamwork, all the “life” muscles. But just like every nightmare, these crises end. And every one of them turns us into better, more resilient, more knowledgeable SREs.

Life as an SRE is a continuous loop of learning and adaptation. Every Swap Storm or critical incident makes us understand our systems better, build sturdier solutions, and — most importantly — be ready for whatever the next unknown looks like. The point isn’t avoiding falls; it’s pulling a lesson from each one and getting back up.

Midnight 'Swap Storm': An SRE's Memory Nightmare

What Is a “Swap Storm” and Why Is It a Nightmare?

The Midnight Call: How It Started

The Diagnostic Process: Hunting Down the Nightmare

Understanding Where the Memory Was Going

Resolution Paths and First Response

Permanent Fixes and What Comes Next

Lessons From an SRE’s Life

Closing Thoughts

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Using ORMs in Side Projects: Is Control Sacrificed for Speed?

Metric Cardinality: High or Low? 4 Steps to Making the Right Choice

The Lasting Weight of Quick Fixes: An SRE's Diary

What Is a “Swap Storm” and Why Is It a Nightmare?

The Midnight Call: How It Started

The Diagnostic Process: Hunting Down the Nightmare

Understanding Where the Memory Was Going

Resolution Paths and First Response

Permanent Fixes and What Comes Next

Lessons From an SRE’s Life

Closing Thoughts

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Using ORMs in Side Projects: Is Control Sacrificed for Speed?

Metric Cardinality: High or Low? 4 Steps to Making the Right Choice

The Lasting Weight of Quick Fixes: An SRE's Diary

Klavye Kısayolları