İçeriğe Atla
Mustafa Erbay
Technology · 10 min read · görüntülenme Türkçe oku
100%

Hunting Silent Packet Loss During MLAG Failover

A signal set, failover testing playbook, and operational decision tree for tracking down silent packet loss in MLAG and LACP topologies.

Hunting Silent Packet Loss During MLAG Failover — cover image

MLAG designs often produce a comforting feeling of “high availability, sorted.” In practice, though, one of the most exhausting cases I run into in the field is silent packet loss that happens without any link going down. Ports look up, LACP looks fine, the CPU looks calm — and yet application latency climbs and certain flows fall apart.

You can’t solve this class of problem just by reading configuration. You have to answer three questions together: which signal tells you what during a failover, where exactly is the data being lost, and which action is actually safe to take.

1) Why is silent packet loss so dangerous?

Because the classic alarm set usually stays quiet:

  • Interfaces are up
  • BGP/OSPF adjacencies are still established
  • The LACP bundle is still formed

But real user impact is happening:

  • Long-lived TCP sessions get reset
  • Retransmits go up
  • Only certain racks or groups of nodes are affected

That’s why the “if it’s not down, it can’t be the network” attitude is so costly when you’re dealing with MLAG issues.

2) Where does it actually break?

The causes I see most often:

  • Asymmetric load or buffer pressure on the peer-link
  • Hash behavior that funnels specific flows to the same problematic member
  • Lag in STP / ARP / MAC state synchronization
  • Vendor bugs or half-state conditions after a software upgrade

Even if these issues only last a few seconds during failover, they can trigger queueing and a retry storm at the upper layers.

3) Monitoring: which signals actually pull their weight?

I prefer to watch these five together:

  • Peer-link throughput and drop counts
  • Per-member-port queue and drop statistics
  • ARP / MAC move events
  • TCP retransmit and connection reset rates on the application side
  • Synthetic probe loss at the moment of failover

These signals are far more valuable than asking “is a cable cut?” because silent loss rarely shows up as a single counter spiking — it becomes visible only when several small signals line up.

4) Testing: are you actually rehearsing failover?

In an MLAG design, confidence only comes from controlled testing. The drill I recommend:

  1. Start synthetic north-south and east-west traffic
  2. Cut a single uplink in a controlled way
  3. Trigger a peer-switch role transition
  4. Measure: loss, jitter, time to reconverge

The success criterion here is not “traffic eventually came back.” What actually matters:

  • How many packets were lost?
  • Which flows were affected?
  • Did the upper layer kick off a retry wave?

5) Runbook: how I move during an incident

  1. Scope it
    • Is the entire service impacted, or just specific nodes / racks?
  2. Separate the signal
    • If no interface is down, head straight to queue/drop and retransmit data
  3. Validate the path
    • Which flow is going through which member?
  4. Temporary mitigation
    • Pull the suspect member out of the bundle
    • Drain a specific uplink if you have to
  5. Permanent action
    • Hash / policy revision
    • Software version and vendor advisory check

The moment you break this order and say “let’s just reboot first,” you erase the most valuable evidence you had.

6) Design choices: how do you shrink the blast radius?

  • For each bundle, make sure the members really are spread across different failure domains
  • Size the peer-link not just for steady state — plan for the worst-case redirection
  • Tie top-of-rack drills into your change process
  • Align timeout and retry budgets with the application teams

An MLAG decision is not purely a network decision. If the upper-layer clients retry too aggressively, even a few seconds of loss balloons into a service-wide event.

Conclusion

In MLAG failover problems, the real cost is weak observability. When you hit “everything is up but users are complaining” type incidents, you make silent packet loss visible by widening your verification layer, running synthetic tests on a regular cadence, and measuring peer-link behavior under real load. Network reliability isn’t measured by the number of redundant links you have — it’s measured by what you can observe the moment something breaks.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts