Hunting Silent Packet Loss During MLAG Failover

MLAG designs often produce a comforting feeling of “high availability, sorted.” In practice, though, one of the most exhausting cases I run into in the field is silent packet loss that happens without any link going down. Ports look up, LACP looks fine, the CPU looks calm — and yet application latency climbs and certain flows fall apart.

You can’t solve this class of problem just by reading configuration. You have to answer three questions together: which signal tells you what during a failover, where exactly is the data being lost, and which action is actually safe to take.

1) Why is silent packet loss so dangerous?

Because the classic alarm set usually stays quiet:

Interfaces are up
BGP/OSPF adjacencies are still established
The LACP bundle is still formed

But real user impact is happening:

Long-lived TCP sessions get reset
Retransmits go up
Only certain racks or groups of nodes are affected

That’s why the “if it’s not down, it can’t be the network” attitude is so costly when you’re dealing with MLAG issues.

2) Where does it actually break?

The causes I see most often:

Asymmetric load or buffer pressure on the peer-link
Hash behavior that funnels specific flows to the same problematic member
Lag in STP / ARP / MAC state synchronization
Vendor bugs or half-state conditions after a software upgrade

Even if these issues only last a few seconds during failover, they can trigger queueing and a retry storm at the upper layers.

3) Monitoring: which signals actually pull their weight?

I prefer to watch these five together:

Peer-link throughput and drop counts
Per-member-port queue and drop statistics
ARP / MAC move events
TCP retransmit and connection reset rates on the application side
Synthetic probe loss at the moment of failover

These signals are far more valuable than asking “is a cable cut?” because silent loss rarely shows up as a single counter spiking — it becomes visible only when several small signals line up.

4) Testing: are you actually rehearsing failover?

In an MLAG design, confidence only comes from controlled testing. The drill I recommend:

Start synthetic north-south and east-west traffic
Cut a single uplink in a controlled way
Trigger a peer-switch role transition
Measure: loss, jitter, time to reconverge

The success criterion here is not “traffic eventually came back.” What actually matters:

How many packets were lost?
Which flows were affected?
Did the upper layer kick off a retry wave?

5) Runbook: how I move during an incident

Scope it
- Is the entire service impacted, or just specific nodes / racks?
Separate the signal
- If no interface is down, head straight to queue/drop and retransmit data
Validate the path
- Which flow is going through which member?
Temporary mitigation
- Pull the suspect member out of the bundle
- Drain a specific uplink if you have to
Permanent action
- Hash / policy revision
- Software version and vendor advisory check

The moment you break this order and say “let’s just reboot first,” you erase the most valuable evidence you had.

6) Design choices: how do you shrink the blast radius?

For each bundle, make sure the members really are spread across different failure domains
Size the peer-link not just for steady state — plan for the worst-case redirection
Tie top-of-rack drills into your change process
Align timeout and retry budgets with the application teams

An MLAG decision is not purely a network decision. If the upper-layer clients retry too aggressively, even a few seconds of loss balloons into a service-wide event.

Conclusion

In MLAG failover problems, the real cost is weak observability. When you hit “everything is up but users are complaining” type incidents, you make silent packet loss visible by widening your verification layer, running synthetic tests on a regular cadence, and measuring peer-link behavior under real load. Network reliability isn’t measured by the number of redundant links you have — it’s measured by what you can observe the moment something breaks.

Hunting Silent Packet Loss During MLAG Failover

1) Why is silent packet loss so dangerous?

2) Where does it actually break?

3) Monitoring: which signals actually pull their weight?

4) Testing: are you actually rehearsing failover?

5) Runbook: how I move during an incident

6) Design choices: how do you shrink the blast radius?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Preventing Edge Outages with BGP Max-Prefix Limits

DDoS Scrubbing Center Design: GRE, BGP, and Failover

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

1) Why is silent packet loss so dangerous?

2) Where does it actually break?

3) Monitoring: which signals actually pull their weight?

4) Testing: are you actually rehearsing failover?

5) Runbook: how I move during an incident

6) Design choices: how do you shrink the blast radius?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Preventing Edge Outages with BGP Max-Prefix Limits

DDoS Scrubbing Center Design: GRE, BGP, and Failover

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

Klavye Kısayolları