#incident

Career ✍️ Hand-written Jun 16, 2026

One Night a Storage System Died and Changed How I Think About Software

One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.

#incident #reliability #post-mortem

5 min

Technology ✍️ Hand-written May 7, 2026

3rd OOM on the VPS: Parallel Builds and a flock Mutex Story

My blog automation collided with another project's build. RAM ran out, sshd reset. Hard reboot + flock for a global build mutex.

#vps #oom #incident

9 min

Technology ✍️ Hand-written May 4, 2026

First OOM: kcompactd at 92% CPU, sshd Reset, Hard Reboot

RAM ran out on my VPS, swap filled up, sshd dropped the connection. When the Astro build triggered an OOM, I decided to put together a layered pipeline defense.

#oom #swap #incident

9 min

Career ✍️ Hand-written May 3, 2026

My Cleanup Script Killed the GitHub Runner: A Self-Inflicted Incident

My disk-cleanup.timer wiped the runner's _work/_temp directories. For 16 hours every cron exploded with 'Missing file: set_output_*'. A confession of…

#incident #github actions #cleanup

7 min

Technology ✍️ Hand-written Apr 28, 2026

Docker Ate 56 GB of Disk in a Day: Building a Cleanup Automation

Disk hit 100% on my VPS and my blog couldn't publish for 5 hours. Docker build cache 33 GB, unused images 23 GB. Pruning + a systemd timer is the permanent fix.

#docker #disk #incident

9 min

Career Apr 23, 2026

The Decision Log and Handoff Discipline During Incident Rotation

How a decision log, a steady handover rhythm, and a clean handoff flow keep context from getting lost when teams swap during long-running outages.

#incident #leadership #operations

9 min

Tutorials Apr 20, 2026

Core Dump Management and Privacy Runbook with systemd-coredump

Collecting core dumps in production: limits, retention, encryption, access and a practical runbook for safe analysis during an incident.

#linux #systemd #debugging

10 min

Career Apr 17, 2026

Major Incident Management: Incident Commander and Runbook Practices

In big outages the largest risk isn't technical, it's coordination. How I drive MTTR down with the IC role, a steady comms cadence, and a practical runbook…

#operations #incident #on-call

12 min

Tutorials Apr 17, 2026

Linux kdump: Kernel Panic Crash Dump and Triage Runbook

Walks through kdump installation, validation and a sustainable production dump retention flow so you can capture vmcore and triage quickly when a kernel panics.

#linux #kdump #operations

13 min

Tutorials Apr 17, 2026

Packet Capture in Production with tcpdump: A Runbook

Practical tcpdump techniques for collecting minimal-yet-sufficient packet evidence during incidents: filters, snaplen, ring buffer, privacy, and handover…

#linux #network #tcpdump

9 min

Career Apr 16, 2026

Mapping Risk with Pre-mortems Before a Change

Living through the failure in your head before going to production: pre-mortem cadence, a template, decision points, and operational leadership in practice.

#leadership #operations #change-management

7 min

Career Apr 14, 2026

Stabilization Sprint After Major Incidents (7 Days)

A postmortem isn't enough: an operational framework for a focused 7-day sprint that closes alert, runbook, risk, and communication debt.

#leadership #operations #incident

10 min

Career Apr 13, 2026

Evidence Collection Kit and Roles During an Incident

An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.

#operations #security #incident

6 min

Career Apr 13, 2026

Minimum Viable Runbook Template and Incident Decision Points

A minimum template, thresholds, and practical examples for turning the runbook from a documentation pile into a tool that produces decisions during an incident.

#operations #incident #leadership

6 min

Klavye Kısayolları