One Night a Storage System Died and Changed How I Think About Software
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
14 posts found.
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
My blog automation collided with another project's build. RAM ran out, sshd reset. Hard reboot + flock for a global build mutex.
RAM ran out on my VPS, swap filled up, sshd dropped the connection. When the Astro build triggered an OOM, I decided to put together a layered pipeline defense.
My disk-cleanup.timer wiped the runner's _work/_temp directories. For 16 hours every cron exploded with 'Missing file: set_output_*'. A confession of…
Disk hit 100% on my VPS and my blog couldn't publish for 5 hours. Docker build cache 33 GB, unused images 23 GB. Pruning + a systemd timer is the permanent fix.
How a decision log, a steady handover rhythm, and a clean handoff flow keep context from getting lost when teams swap during long-running outages.
Collecting core dumps in production: limits, retention, encryption, access and a practical runbook for safe analysis during an incident.
In big outages the largest risk isn't technical, it's coordination. How I drive MTTR down with the IC role, a steady comms cadence, and a practical runbook…
Walks through kdump installation, validation and a sustainable production dump retention flow so you can capture vmcore and triage quickly when a kernel panics.
Practical tcpdump techniques for collecting minimal-yet-sufficient packet evidence during incidents: filters, snaplen, ring buffer, privacy, and handover…
Living through the failure in your head before going to production: pre-mortem cadence, a template, decision points, and operational leadership in practice.
A postmortem isn't enough: an operational framework for a focused 7-day sprint that closes alert, runbook, risk, and communication debt.
An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.
A minimum template, thresholds, and practical examples for turning the runbook from a documentation pile into a tool that produces decisions during an incident.