One Night a Storage System Died and Changed How I Think About Software
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
38 posts found.
One night a storage system died and I realized the problem was never the disks — it was assuming nothing would fail. On assumptions, trust, and safety.
We examine 3 common misconceptions in push notification delivery and the issues they cause in real-world systems. Improving reliability...
Why are retries in distributed systems inevitable? Practical approaches and life lessons learned from twenty years of experience.
From OOM scenarios on my own VPS to Docker disk fires, why system architecture is a discipline that requires constant vigilance…
We dig deep into the complex operational challenges, hidden dangers and potential dead ends of distributed lock mechanisms.
A deep look at the risks the eventual consistency model brings to distributed systems, and how to prevent critical data loss like missing orders.
Learn how you can unintentionally take your systems down while trying to save them, and how to avoid the Failover Paradox.
Misapplying or skipping the circuit breaker pattern in microservice architectures can cause serious crises in production environments. In this post…
Take a deep look at the 'Thundering Herd' problem that threatens performance and stability in distributed systems. Understand this destructive effect and…
Discover that SRE is not just about technology, but also about human health and team well-being. A roadmap for moving from pager fatigue to a proactive…
Treating configuration like a product: feature flags, parameter store, schema, approval flow, audit log, and rollback discipline.
In distributed systems, badly designed retries make outages worse. An approach to limiting damage with timeout budgets, retry budgets, and backpressure.
Threshold, signal and rollback discipline for Envoy outlier detection — shrinking the blast radius of broken nodes in distributed systems.
A signal set, failover testing playbook, and operational decision tree for tracking down silent packet loss in MLAG and LACP topologies.
A guide to building PostgreSQL PITR practice with production discipline: WAL archiving, recovery time targets and safe restoration steps.
Traffic steering discipline for multi-region services using GSLB, built around health signals, hold-down, and controlled failback.
A guide to building an operable service discovery layer with Consul through health-driven service registration and the DNS interface.
In big outages the largest risk isn't technical, it's coordination. How I drive MTTR down with the IC role, a steady comms cadence, and a practical runbook…
A practical edge design guide that addresses routing, health signals, capacity, and attack scenarios together to see Anycast's real benefits.
Designing, monitoring, and writing an incident runbook for the max-prefix guardrail that protects edge routers during route leaks and bad-prefix waves.
GRE tunnels, BGP signaling, capacity, and an operational runbook to keep the service up by diverting traffic to scrubbing during an attack.
A practical architecture and operations guide for handling long-lived HTTP/2 connections, idle timeouts, and retry storms without losing your SLO.
A practical runbook for steering traffic with localpref, community, prepend, and MED in multi-ISP and multi-POP environments — measurable and reversible.
An expand/contract approach for schema changes without downtime, plus backfill strategy, dual-write risks, and a rollback plan.
When are sticky sessions essential and when are they technical debt for WebSocket, long TCP sessions and stateful applications? A decision matrix grounded…
Walks through kdump installation, validation and a sustainable production dump retention flow so you can capture vmcore and triage quickly when a kernel panics.
Reduce 'stuck but not dead' failures with systemd WatchdogSec + notify: unit configuration, restart policy, and alarm integration.
Beyond installing Ceph: an architectural approach to failure domain, capacity, and recovery behavior so the cluster can actually heal during a fault.
When pool members appear 'UP' but traffic vanishes, combining active checks with passive signals to design failover that actually reflects reality.
A guide to taming the stampede (thundering herd) risk that can crush a backend after TTL expiry or a cache flush — using jitter, singleflight, and stale…
Field runbook to rapidly triage hung deploys caused by Validating/Mutating webhook latency and apply a risk-controlled mitigation.
A runbook for quickly diagnosing ETCD quorum during API 5xx/timeout storms and walking through safe recovery steps via snapshot restore.
A postmortem isn't enough: an operational framework for a focused 7-day sprint that closes alert, runbook, risk, and communication debt.
Hypotheses, blast radius and automatic rollback guardrails so resilience tests don't turn into blind risks in production.
Producing controlled loss instead of a random collapse when a system is under pressure: rate limits, queues, feature flags and prioritization.
A practical APF setup that prioritizes critical traffic and fairly queues noisy callers, lowering the risk of API server overload.
Bringing reliable processing guarantees to message-based architectures with outbox, dedup keys, DLQ, and a replay runbook.
A retry corridor that prevents repeated calls from producing data inconsistencies and improves resilience in ERP integrations.