#resilience

Life May 19, 2026

Retries in Distributed Systems: My Observations

Why are retries in distributed systems inevitable? Practical approaches and life lessons learned from twenty years of experience.

#life #distributed systems #resilience

9 min

Life Apr 30, 2026

The Unexpected Chaos Engineering Test of Distributed Systems in…

Discover how unexpected failures are managed in distributed systems and how Chaos Engineering principles save lives in real-world scenarios.

#life #chaos-engineering #distributed-systems

10 min

Technology Apr 27, 2026

Broadcast Storms in Virtual Networks: The Hidden Killer of…

Examine the causes and impact of broadcast storms that can erupt inside virtual networks of microservice architectures, and learn how to prevent this…

#broadcast storm #microservices #virtual networks

11 min

Technology Apr 16, 2026

Kernel Live Patching and a Maintenance Model on Enterprise Linux

Managing kernel security patches without reboot pressure: a live-patch approach, the risks, a ring strategy, and operational discipline.

#linux #security #operations

8 min

Technology Apr 13, 2026

Reducing Outage Impact in Planned Maintenance with BGP Graceful…

Graceful restart logic, risks, verification steps, and a rollback standard for doing BGP maintenance without 'dropping routes'.

#bgp #network #operations

6 min

Technology Apr 9, 2026

An Active-Active Integration Corridor for ERP Infrastructures

An architectural approach focused on resilience and consistency that runs the integration layer active-active without straining the ERP core.

#erp #integration #architecture