One Night a Storage System Died and Changed How I Think About Software
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
9 posts found.
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
How I rode out the OOM (Out of Memory) crisis while running 13 containers on a 1 GB RAM VPS, how kcompactd0 captured the CPU, and the fixes I shipped...
My blog automation collided with another project's build. RAM ran out, sshd reset. Hard reboot + flock for a global build mutex.
RAM ran out on my VPS, swap filled up, sshd dropped the connection. When the Astro build triggered an OOM, I decided to put together a layered pipeline defense.
My disk-cleanup.timer wiped the runner's _work/_temp directories. For 16 hours every cron exploded with 'Missing file: set_output_*'. A confession of…
Explore the causes and consequences of cross-team tension during a critical incident, and the steps needed to manage it. Effective leadership…
Learning from mistakes is a hard road. Look at the personal price tag behind post-mortem culture, the shift from blame to learning, and the individual…
My AI content pipeline blew up with three different format quirks: a slashed tag, a quoted date, a dotted-i character. Solved with a single normalizer.
A post-mortem after a major outage isn't just a technical review. Understanding and managing the psychological, invisible burden engineers carry through it…