#reliability

Career ✍️ Hand-written Jun 16, 2026

One Night a Storage System Died and Changed How I Think About Software

One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.

#incident #reliability #post-mortem

5 min

Technology Jun 2, 2026

Push Notification Reliability: 3 Core Misconceptions

We examine 3 common misconceptions in push notification delivery and the issues they cause in real-world systems. Improving reliability...

#technology #push notification #reliability

11 min

Life May 19, 2026

Retries in Distributed Systems: My Observations

Why are retries in distributed systems inevitable? Practical approaches and life lessons learned from twenty years of experience.

#life #distributed systems #resilience

9 min

Career ✍️ Hand-written May 9, 2026

System Architecture is a Bit About Paranoia

From OOM scenarios on my own VPS to Docker disk fires, why system architecture is a discipline that requires constant vigilance…

#mimari #paranoya #operasyon

8 min

Technology May 6, 2026

The Silent Dead End of Distributed Lock Mechanisms: An Operational War

We dig deep into the complex operational challenges, hidden dangers and potential dead ends of distributed lock mechanisms.

#technology #distributed systems #concurrency

8 min

Technology May 4, 2026

The Eventual Consistency Trap: The Mystery of the Lost Orders

A deep look at the risks the eventual consistency model brings to distributed systems, and how to prevent critical data loss like missing orders.

#technology #distributed systems #consistency

10 min

Career May 1, 2026

The Failover Paradox: Bringing Down a System While Trying to Save It

Learn how you can unintentionally take your systems down while trying to save them, and how to avoid the Failover Paradox.

#career #systems #reliability

1 min

Tutorials Apr 30, 2026

Circuit Breaker Crisis in Production: The Fragility of Microservices

Misapplying or skipping the circuit breaker pattern in microservice architectures can cause serious crises in production environments. In this post…

#tutorials #microservices #circuit breaker

10 min

Technology Apr 25, 2026

The 'Thundering Herd' Problem in Distributed Systems: Anatomy of a…

Take a deep look at the 'Thundering Herd' problem that threatens performance and stability in distributed systems. Understand this destructive effect and…

#distributed-systems #thundering-herd #system-design

9 min

Career Apr 23, 2026

The Human Side of SRE: From Pager Fatigue to Proactive Trust

Discover that SRE is not just about technology, but also about human health and team well-being. A roadmap for moving from pager fatigue to a proactive…

#career #SRE #pager fatigue

12 min

Technology Apr 23, 2026

Feature Flags and Configuration Governance: Parameter Store and Audit

Treating configuration like a product: feature flags, parameter store, schema, approval flow, audit log, and rollback discipline.

#architecture #security #operations

10 min

Technology Apr 22, 2026

Retry Storms: Timeout Budget and Latency Amplification

In distributed systems, badly designed retries make outages worse. An approach to limiting damage with timeout budgets, retry budgets, and backpressure.

#architecture #reliability #performance

9 min

Technology Apr 21, 2026

Isolating Bad Nodes with Envoy Outlier Detection

Threshold, signal and rollback discipline for Envoy outlier detection — shrinking the blast radius of broken nodes in distributed systems.

#envoy #service-mesh #reliability

10 min

Technology Apr 20, 2026

Hunting Silent Packet Loss During MLAG Failover

A signal set, failover testing playbook, and operational decision tree for tracking down silent packet loss in MLAG and LACP topologies.

#network #mlag #lacp

10 min

Tutorials Apr 19, 2026

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

A guide to building PostgreSQL PITR practice with production discipline: WAL archiving, recovery time targets and safe restoration steps.

#postgresql #backup #disaster-recovery

11 min

Technology Apr 18, 2026

Multi-Region Traffic Steering and Failover Discipline with GSLB

Traffic steering discipline for multi-region services using GSLB, built around health signals, hold-down, and controlled failback.

#dns #gslb #availability

12 min

Tutorials Apr 18, 2026

Service Discovery with Consul: Health Checks and the DNS Interface

A guide to building an operable service discovery layer with Consul through health-driven service registration and the DNS interface.

#service-discovery #dns #consul

13 min

Career Apr 17, 2026

Major Incident Management: Incident Commander and Runbook Practices

In big outages the largest risk isn't technical, it's coordination. How I drive MTTR down with the IC role, a steady comms cadence, and a practical runbook…

#operations #incident #on-call

12 min

Technology Apr 17, 2026

Edge Service Design with BGP Anycast: DNS and DDoS Resilience

A practical edge design guide that addresses routing, health signals, capacity, and attack scenarios together to see Anycast's real benefits.

#network #bgp #anycast

12 min

Technology Apr 17, 2026

Preventing Edge Outages with BGP Max-Prefix Limits

Designing, monitoring, and writing an incident runbook for the max-prefix guardrail that protects edge routers during route leaks and bad-prefix waves.

#bgp #network #reliability

10 min

Technology Apr 17, 2026

DDoS Scrubbing Center Design: GRE, BGP, and Failover

GRE tunnels, BGP signaling, capacity, and an operational runbook to keep the service up by diverting traffic to scrubbing during an attack.

#security #ddos #network

12 min

Technology Apr 17, 2026

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

A practical architecture and operations guide for handling long-lived HTTP/2 connections, idle timeouts, and retry storms without losing your SLO.

#grpc #http2 #load-balancing

12 min

Technology Apr 17, 2026

BGP Traffic Engineering Runbook for the Enterprise Edge

A practical runbook for steering traffic with localpref, community, prepend, and MED in multi-ISP and multi-POP environments — measurable and reversible.

#network #bgp #edge

12 min

Technology Apr 17, 2026

Online Schema Migration: Expand/Contract, Backfill, and Dual Write

An expand/contract approach for schema changes without downtime, plus backfill strategy, dual-write risks, and a rollback plan.

#database #schema-migration #reliability

13 min

Technology Apr 17, 2026

Sticky Sessions and Load Balancer Decisions for Stateful Traffic

When are sticky sessions essential and when are they technical debt for WebSocket, long TCP sessions and stateful applications? A decision matrix grounded…

#architecture #load-balancing #reliability

11 min

Tutorials Apr 17, 2026

Linux kdump: Kernel Panic Crash Dump and Triage Runbook

Walks through kdump installation, validation and a sustainable production dump retention flow so you can capture vmcore and triage quickly when a kernel panics.

#linux #kdump #operations

13 min

Tutorials ✍️ Hand-written Apr 17, 2026

Self-Healing Services with systemd Watchdog

Reduce 'stuck but not dead' failures with systemd WatchdogSec + notify: unit configuration, restart policy, and alarm integration.

#linux #systemd #reliability

8 min

Technology Apr 16, 2026

Object Storage with Ceph: Failure Domain and Recovery Design

Beyond installing Ceph: an architectural approach to failure domain, capacity, and recovery behavior so the cluster can actually heal during a fault.

#storage #ceph #infrastructure

12 min

Technology Apr 16, 2026

Health Check Blindness in L4 Pools: Failover and Blackholes

When pool members appear 'UP' but traffic vanishes, combining active checks with passive signals to design failover that actually reflects reality.

#network #load-balancing #reliability

11 min

Technology Apr 15, 2026

Cache Stampede (Thundering Herd) and Operational Defenses

A guide to taming the stampede (thundering herd) risk that can crush a backend after TTL expiry or a cache flush — using jitter, singleflight, and stale…

#architecture #performance #cache

12 min