İçeriğe Atla
Mustafa Erbay
Technology Written by human Production Diaries · 9 min read · görüntülenme Türkçe oku
100%

First OOM: kcompactd at 92% CPU, sshd Reset, Hard Reboot

RAM ran out on my VPS, swap filled up, sshd dropped the connection. When the Astro build triggered an OOM, I decided to put together a layered pipeline defense.

First OOM: kcompactd at 92% CPU, sshd Reset, Hard Reboot — true story cover image

It started with a “the site won’t load” message

Around 8:00 AM, a message landed on my phone: “are the sites back up, friend? not loading.”

I open mustafaerbay.com.tr — 8-second timeout. The other apps on the same VPS — same thing. Not the first time, but the deepest so far.

Diagnostic flow: SSH works but it’s hanging

First SSH:

$ ssh vps
[5 second wait]
$ uptime
 05:27:24 up 9 days,  7:51,  3 users,
 load average: 52.51, 76.02, 70.66

Load 52, 76, 70. On a healthy system that should be 4. This is hell.

free -h:

               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       7.5Gi       122Mi       147Mi       330Mi        76Mi
Swap:          4.0Gi       3.9Gi       106Mi

All 7.6 GB of RAM is used. 76 MB available. Swap is full, 106 MB free. The system is running 4-5 times slower than normal because every memory allocation is hitting swap.

Which processes are doing this damage? ps aux --sort=-%cpu:

USER         PID %CPU %MEM    COMMAND
root          54 92.9  0.0    kcompactd0  ← HERE
ubuntu    382248 24.4 32.3    node ... astro build --out-dir dist-new
root      383691 26.4  7.0    node ... next build
github-+  379827  4.4  0.9    Runner.Worker spawnclient

kcompactd0 at 92% CPU. That’s a new acquaintance for me. It’s the “memory compaction” daemon of the Linux memory subsystem — the kernel kicks it in when it can’t find a contiguous block (“RAM is fragmented, can’t find a contiguous block”), and it tries to coalesce small free chunks. Seeing it eating 92% CPU means the system is spending most of its CPU just looking for memory.

And I can see why: simultaneously, an Astro build (2.5 GB) and a Next.js build (615 MB) are running. About 3 GB just from those two. The remaining 4 GB is system services + containers + sshd. Total demand > 7.6 GB → swap → swap full → kcompactd panics.

Why are the sites timing out?

I’d seen with free -h that RAM had only 76 MB free. When sshd accepts a connection, it does a fork(2) to spawn a child process to respond. fork wants RAM. There’s wild swap thrashing on RAM, fork waits → connection reset by peer.

curl dies a similar half-death:

$ curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" --max-time 8 https://mustafaerbay.com.tr
000 8.006s

Nginx accepted the connection (port 443 is open), tries to proxy_pass to the Node app (127.0.0.1:3040), the Node app can’t respond (no RAM). Timeout.

Who triggered this Astro build?

$ ps -p 382248 -o pid,user,cmd
    PID USER     CMD
 382248 ubuntu   node /opt/mustafaerbay/node_modules/.bin/astro build --out-dir dist-new

ubuntu user, output to /opt/mustafaerbay/dist-new. That belongs to my update.sh deploy script. So I triggered it — I did a git push, the VPS deploy timer pulled, the build started.

How long has it been running? 18+ minutes. A normal Astro build is 4-5 minutes. Stuck. Under memory pressure every step slows down, and the actual work can’t finish.

No fix, hard reset

I tried gh run cancel — didn’t get there (signals don’t get delivered under RAM thrashing). My kill 382248 over ssh mostly didn’t take. The OOM killer wasn’t catching up either, because it also wants memory; the system was so bloated that the kernel was running slow.

Only one option left: log into Hostinger and hit hard reset. 90 seconds later it’s back:

$ ssh vps 'uptime; free -h | head -2'
 05:38:07 up 0 min,  2 users,  load average: 2.74, 0.58, 0.19
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.4Gi       4.9Gi        83Mi       1.6Gi       6.1Gi

6.1 GB available. All containers auto-started thanks to restart: unless-stopped. The Postgres instances recovered through their WAL. Sites returning 200.

Now: how to keep this from happening again

I started working on it that first night. The single reflex of “make the build use less” isn’t enough — that’s defensive. I have to be preventive. I need a few layers.

1. Pre-flight resource guard (workflow)

- name: Pre-flight resource check
  id: preflight
  run: |
    AVAIL_GB=$(df -BG / | tail -1 | awk '{print $4}' | tr -d 'G')
    LOAD=$(awk '{print $1}' /proc/loadavg)
    LOAD_INT=${LOAD%.*}
    MEM_AVAIL_MB=$(awk '/MemAvailable/{print int($2/1024)}' /proc/meminfo)

    if [ "$AVAIL_GB" -lt 5 ] || [ "$LOAD_INT" -gt 8 ] || [ "$MEM_AVAIL_MB" -lt 1500 ]; then
      echo "skip=1" >> "$GITHUB_OUTPUT"
      echo "::warning::skipping — insufficient resources"
    fi

Before the workflow even starts, check whether the VPS can breathe. If not, graceful skip → success exit, no email spam.

2. URL polling instead of sleep 360

I used to have sleep 360 (6 minutes) in the workflow, just to wait for the deploy to finish. Sleep doesn’t actively use RAM but it occupies a runner slot. If an OOM happens during the build, the sleep step gets SIGKILL’d → workflow fail → email.

New version:

for i in $(seq 1 108); do
  if curl -fsS -o /dev/null --max-time 5 "$URL"; then
    echo "Deploy detected on attempt ${i}"
    exit 0
  fi
  sleep 5
done

URL polling. Move forward as soon as the site goes live. Short intervals instead of one long sleep — resilient against OOM-killable situations.

3. AI quirk auto-fixer

The day before, three different AI quirks had failed cron jobs. I gathered them all into a single normalizer using a “fix instead of reject” strategy. I’d written about this earlier, but it’s relevant: fail-soft mentality everywhere.

4. Pipeline-health monitor

This might be the most important one. The file /var/lib/mustafaerbay/health-state keeps the latest status (healthy / degraded). Every 4 hours a cron checks the most recent Bluesky post. If there’s no post within 4+ hours, state-change = first DEGRADED email. After that it checks every 4 hours, and if posting resumes a RECOVERED email is sent. Even if 100 crons run in the same state, no email gets sent.

In this morning’s incident, 16 cron jobs failed back to back, and only 1 email came through. In the classic “send an email for every workflow run” world, 16 emails would have arrived.

A deeper lesson

If I hadn’t lived through this OOM, I wouldn’t have built the layered defense. You have to live it once — you need that concrete experience that says this can happen. Now I have pre-flight at the start of every cron, polling instead of build wait, an AI output normalizer, and state-change alerts.

None of these prevent the OOM on its own. They prevent it together. Classic Swiss cheese model:

[OOM happens] →
  layer 1: pre-flight skips (resources already seen as insufficient)
  → layer 2: polling-wait is OOM-resilient
  → layer 3: auto-fixer doesn't trip on AI quirks
  → layer 4: pipeline-health monitor only alerts on state-change
  → result: 1 cron skip per half hour, no email, healthy system

If there were just one defense, it would eventually find a hole. Four sliced cheese slices in a row, holes don’t line up → the ball doesn’t get through.

Conclusion

I got caught off guard by this OOM. But when you think of the post-mortem as a chance to build a four-layer pipeline reliability system, you have to say it was a good thing. A bad day can produce value — if you write it up.

The next cron will run in 30 minutes. Pre-flight will be checking. If RAM is still under 5 GB, it’ll skip. I’m no longer worried, because I’ve seen this system work — the pipeline keeps going on its own.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts