Multi-Region Traffic Steering and Failover Discipline with GSLB
Traffic steering discipline for multi-region services using GSLB, built around health signals, hold-down, and controlled failback.
23 posts found.
Traffic steering discipline for multi-region services using GSLB, built around health signals, hold-down, and controlled failback.
Assuming the release is done is how you summon an incident. A practical framework for turning post-change verification into a cadence: fast smoke checks…
A practical runbook for steering traffic with localpref, community, prepend, and MED in multi-ISP and multi-POP environments — measurable and reversible.
When API Server access suddenly breaks with x509 errors; certificate renewal and safe recovery steps for kubeadm-based clusters.
Quick triage, measurement and safe tuning steps (ring, queue, IRQ, RPS) under packet drops, high softirq load and ksoftirqd pressure.
Walks through quorum, replication lag, switchover/failover testing and recovery steps when running PostgreSQL high availability with Patroni, in runbook form.
A runbook for shrinking deploy impact by separating connection acceptance into a socket unit, so the listening port never drops during service restarts.
Manage the ESXi host patch process with ring-based maintenance waves, control capacity/HA risk, and establish safe remediation and rollback discipline.
Turning go-live from 'ship and pray' into something with clear risk, ownership, and rollback reflex: a practical ORR gate and checklist.
A practical RBAC framework for role design, identity integration, and time-boxed emergency access (break-glass) without depending on cluster-admin.
A runbook that turns firmware upgrade work into a repeatable maintenance rhythm with inventory, ring/wave approach, validation metrics, and a rollback…
Field runbook to rapidly triage hung deploys caused by Validating/Mutating webhook latency and apply a risk-controlled mitigation.
A runbook for quickly diagnosing ETCD quorum during API 5xx/timeout storms and walking through safe recovery steps via snapshot restore.
An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.
A minimum template, thresholds, and practical examples for turning the runbook from a documentation pile into a tool that produces decisions during an incident.
Realistic on-call, escalation, and runbook design that reduces pager fatigue, speeds up decision-making, and clarifies incident communication.
A controlled approach to reducing DDoS impact during operations using an RTBH/FlowSpec decision tree, verification steps, and a rollback plan.
A runbook to triage the 401 wave (kid mismatch/JWKS cache) that occurs during JWT key rotation, and to set up safe overlap/caching strategy.
A practical guide for generating signals before the nf_conntrack table fills up, applying safe sysctl tuning, and recovering in a controlled way during an…
A runbook to triage the connect timeout crisis when the SYN backlog/accept queue fills up, apply rapid mitigation, and design lasting resilience.
A field-ready runbook for operationally managing quorum, failover, and split-brain risk in a Redis Sentinel-based HA setup.
PSI, systemd-oomd policy, testing, and recovery steps to catch a node OOM crisis early and evict workloads in a controlled way.
A technical leadership approach to runbook debt management that moves operational memory off individuals and onto the system.