İçeriğe Atla
Mustafa Erbay
Technology Written by human Production Diaries · 9 min read · görüntülenme Türkçe oku
100%

Docker Ate 56 GB of Disk in a Day: Building a Cleanup Automation

Disk hit 100% on my VPS and my blog couldn't publish for 5 hours. Docker build cache 33 GB, unused images 23 GB. Pruning + a systemd timer is the permanent fix.

Docker Ate 56 GB of Disk in a Day: Building a Cleanup Automation — true story cover image

“no posts for hours” — the message I got

I noticed it in the evening — my hourly content-generate cron hadn’t completed a single successful run since morning. The pipeline-health monitor hadn’t fired its state-change email yet (the 4-hour threshold hadn’t been hit), but the GitHub Actions panel was bright red.

The last successful run finished at 2026-05-04 12:11 UTC. More than 5 hours had passed. Zero new content. The single most common reason this blog goes down is resource starvation — disk or RAM. I quickly figured out which one.

A line that jumped out at me from the run log:

##[error] System.IO.IOException: No space left on device
  : '/home/github-runner/runner-mustafaerbay/_diag/pages/...log'

The runner couldn’t write its own log file — no space on disk. At that point it hadn’t even reached the validate step; the runner’s own _diag layer was dead. Each cron tick retried and blew up at the exact same place.

SSH into the VPS and see

$ ssh vps 'df -h /'
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        72G   72G   11M 100% /

72 GB disk with 11 MB free. You can’t even write a single log line.

The second query was more interesting:

$ ssh vps 'sudo du -hx --max-depth=2 / 2>/dev/null | sort -hr | head -10'
72G  /
54G  /var
39G  /var/lib
15G  /var/www
7.2G /home
5.3G /home/github-runner
4.0G /usr
2.9G /opt
2.6G /usr/lib
2.3G /opt/mustafaerbay

/var/lib was 39 GB. /var/www was 15 GB. This isn’t a personal blog VPS — it has 6 different projects on it. My eye went straight to /var/lib because that’s where Docker lives.

$ ssh vps 'sudo du -hx --max-depth=1 /var/lib | sort -hr | head'
39G  /var/lib
38G  /var/lib/docker  <- HERE
169M /var/lib/dkms
164M /var/lib/Acronis
140M /var/lib/apt
6.3M /var/lib/mustafaerbay

Docker on its own: 38 GB.

Crack open Docker’s internals

$ ssh vps 'sudo docker system df'
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          33        9         27.5GB    23.27GB (84%)
Containers      13        13        1.192MB   0B (0%)
Local Volumes   8         8         387MB     0B (0%)
Build Cache     388       0         7.695GB   7.695GB

This table answers the question directly:

  • 33 images exist, only 9 are active. 24 are “not in use but not deleted.” 23.27 GB reclaimable.
  • 388 build cache layers, 0 active. The whole 7.7 GB is up for deletion.
  • Containers and volumes are normal — I don’t want to wipe those (postgres data, etc., lives there).

Total reclaimable: ~31 GB. Just freeing that would open up enough space.

See the running containers first, then cut

Don’t rush. docker system prune -a deletes everything — you have to know the line between running and not running images. I checked the docker ps output: half a dozen different projects had containers on this VPS — a few of my own side products and some client work. 13 healthy containers in total: postgres, redis, Next.js apps, an Astro SSR service, and a few workers. The only things that can be reclaimed are unanchored images — older image versions not referenced by any running container.

I lined up two safe commands:

# 1. Build cache (old layers, nothing uses them)
sudo docker builder prune -af

# 2. Unused images (the ones not anchored to a running container)
sudo docker image prune -af

The -a flag also removes tagged-but-not-dangling images. Risky? I don’t think so — anything anchored to an active container won’t be removed anyway (Docker keeps a reference count). Only the “once built, used, then a newer version came along” old images go.

The result:

=== Docker build cache ===
Total reclaimed space: 33.48GB

=== Docker unused images ===
Total reclaimed space: 22.62GB

=== After ===
/dev/sda1   72G   40G   33G  56% /

I reclaimed 56 GB. Disk went from 100% to 56%. 33 GB free. All 13 containers kept running.

Now: let’s automate this

This wasn’t even the third time it happened — let it be a lesson:

“Once it happens twice, do a manual fix. The third time, automate it.”

The disk-cleanup.sh script I wrote is simple but careful. A few principles:

#!/usr/bin/env bash
set -euo pipefail

echo "=== disk-cleanup starting ==="
echo "before: $(df -h / | tail -1)"

# 1) Docker build cache > 72h (newer cache survives)
echo "-- docker builder prune (>72h)"
docker builder prune -af --filter "until=72h"

# 2) Dangling docker images (no -a — tagged-but-unused IS PRESERVED)
echo "-- docker image prune (dangling only)"
docker image prune -f

# 3) journal > 7d
journalctl --vacuum-time=7d

# 4) APT cache (regenerable)
apt-get clean

# 5) mustafaerbay dist-old (deploy backup, regenerated each deploy)
[ -d /opt/mustafaerbay/dist-old ] && rm -rf /opt/mustafaerbay/dist-old

# 6) GitHub runner _diag log files > 14d (files only, LEAVE the directories alone)
find /home/github-runner -path '*/_diag/*' -type f -name '*.log' -mtime +14 -delete

echo "after:  $(df -h / | tail -1)"

Hooked it up to a daily timer running at 03:30 UTC:

[Timer]
OnCalendar=*-*-* 03:30:00
RandomizedDelaySec=10m
Persistent=true

RandomizedDelaySec=10m — so it doesn’t collide with any other 03:30 cron jobs that might be on the system. Persistent=true — if the VPS rebooted, the run that got skipped will still happen.

Conclusion: the one-line why

My disk filled up because Docker doesn’t clean up at all. Every docker-compose build creates a new image; it doesn’t keep the old one referenced, but it doesn’t delete it either. A few months in, a year in, your disk explodes and you go “wow, did the AI grow this much?”

Nobody’s growing it, actually. Docker is a hoarder. If you’re not active about it, the disk fire teaches you that.

A two-hour manual recovery + a one-hour systemd timer setup = the guarantee I won’t go through this again. That’s the real lesson: turn an incident into the foundation for the next one.

Tomorrow disk-cleanup.timer will run for the first time. I’m watching.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts