#runbook

Technology Apr 18, 2026

Multi-Region Traffic Steering and Failover Discipline with GSLB

Traffic steering discipline for multi-region services using GSLB, built around health signals, hold-down, and controlled failback.

#dns #gslb #availability

12 min

Career Apr 17, 2026

Post-Change Verification Cadence: Smoke, SLO, and Rollback

Assuming the release is done is how you summon an incident. A practical framework for turning post-change verification into a cadence: fast smoke checks…

#leadership #operations #release

8 min

Technology Apr 17, 2026

BGP Traffic Engineering Runbook for the Enterprise Edge

A practical runbook for steering traffic with localpref, community, prepend, and MED in multi-ISP and multi-POP environments — measurable and reversible.

#network #bgp #edge

12 min

Tutorials Apr 17, 2026

Kubernetes Control Plane Certificate Expiry: A Runbook

When API Server access suddenly breaks with x509 errors; certificate renewal and safe recovery steps for kubeadm-based clusters.

#kubernetes #security #operations

13 min

Tutorials Apr 17, 2026

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Quick triage, measurement and safe tuning steps (ring, queue, IRQ, RPS) under packet drops, high softirq load and ksoftirqd pressure.

#linux #network #performance

14 min

Tutorials Apr 17, 2026

PostgreSQL HA: Failover Runbook with Patroni

Walks through quorum, replication lag, switchover/failover testing and recovery steps when running PostgreSQL high availability with Patroni, in runbook form.

#database #postgresql #patroni

13 min

Tutorials Apr 17, 2026

Zero-Downtime Restart with systemd Socket Activation

A runbook for shrinking deploy impact by separating connection acceptance into a socket unit, so the listening port never drops during service restarts.

#linux #systemd #operations

10 min

Tutorials Apr 17, 2026

vSphere/ESXi Host Patch: Maintenance Wave and Rollback Runbook

Manage the ESXi host patch process with ring-based maintenance waves, control capacity/HA risk, and establish safe remediation and rollback discipline.

#infrastructure #vmware #vsphere

13 min

Career Apr 16, 2026

Operational Readiness Review (ORR) Before Go-Live

Turning go-live from 'ship and pray' into something with clear risk, ownership, and rollback reflex: a practical ORR gate and checklist.

#operations #leadership #risk

9 min

Tutorials Apr 16, 2026

Kubernetes RBAC: Least Privilege + Break-Glass Model

A practical RBAC framework for role design, identity integration, and time-boxed emergency access (break-glass) without depending on cluster-admin.

#kubernetes #rbac #security

12 min

Tutorials Apr 16, 2026

A Maintenance-Wave Runbook for Firmware Upgrades on Enterprise…

A runbook that turns firmware upgrade work into a repeatable maintenance rhythm with inventory, ring/wave approach, validation metrics, and a rollback…

#network #infrastructure #maintenance

11 min

Tutorials Apr 15, 2026

Kubernetes Admission Webhook Timeouts: A Runbook for Frozen Deploys

Field runbook to rapidly triage hung deploys caused by Validating/Mutating webhook latency and apply a risk-controlled mitigation.

#kubernetes #admission #operations

12 min

Tutorials Apr 15, 2026

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

A runbook for quickly diagnosing ETCD quorum during API 5xx/timeout storms and walking through safe recovery steps via snapshot restore.

#kubernetes #etcd #operations

9 min

Career Apr 13, 2026

Evidence Collection Kit and Roles During an Incident

An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.

#operations #security #incident

6 min

Career Apr 13, 2026

Minimum Viable Runbook Template and Incident Decision Points

A minimum template, thresholds, and practical examples for turning the runbook from a documentation pile into a tool that produces decisions during an incident.

#operations #incident #leadership

6 min

Career Apr 13, 2026

On-Call Rotation and Escalation Design: Operational Calm

Realistic on-call, escalation, and runbook design that reduces pager fatigue, speeds up decision-making, and clarifies incident communication.

#on-call #incident-management #operations

3 min

Technology Apr 13, 2026

DDoS Response Runbook with BGP RTBH and FlowSpec

A controlled approach to reducing DDoS impact during operations using an RTBH/FlowSpec decision tree, verification steps, and a rollback plan.

#bgp #ddos #network

4 min

Tutorials Apr 13, 2026

Operational Runbook for JWKS Key Rotation

A runbook to triage the 401 wave (kid mismatch/JWKS cache) that occurs during JWT key rotation, and to set up safe overlap/caching strategy.

#security #identity #jwt

7 min

Tutorials Apr 13, 2026

Linux Conntrack Capacity Planning and Alerting Runbook

A practical guide for generating signals before the nf_conntrack table fills up, applying safe sysctl tuning, and recovering in a controlled way during an…

#linux #network #conntrack

8 min