MLAG designs often produce a comforting feeling of “high availability, sorted.” In practice, though, one of the most exhausting cases I run into in the field is silent packet loss that happens without any link going down. Ports look up, LACP looks fine, the CPU looks calm — and yet application latency climbs and certain flows fall apart.
You can’t solve this class of problem just by reading configuration. You have to answer three questions together: which signal tells you what during a failover, where exactly is the data being lost, and which action is actually safe to take.
1) Why is silent packet loss so dangerous?
Because the classic alarm set usually stays quiet:
- Interfaces are up
- BGP/OSPF adjacencies are still established
- The LACP bundle is still formed
But real user impact is happening:
- Long-lived TCP sessions get reset
- Retransmits go up
- Only certain racks or groups of nodes are affected
That’s why the “if it’s not down, it can’t be the network” attitude is so costly when you’re dealing with MLAG issues.
2) Where does it actually break?
The causes I see most often:
- Asymmetric load or buffer pressure on the peer-link
- Hash behavior that funnels specific flows to the same problematic member
- Lag in STP / ARP / MAC state synchronization
- Vendor bugs or half-state conditions after a software upgrade
Even if these issues only last a few seconds during failover, they can trigger queueing and a retry storm at the upper layers.
3) Monitoring: which signals actually pull their weight?
I prefer to watch these five together:
- Peer-link throughput and drop counts
- Per-member-port queue and drop statistics
- ARP / MAC move events
- TCP retransmit and connection reset rates on the application side
- Synthetic probe loss at the moment of failover
These signals are far more valuable than asking “is a cable cut?” because silent loss rarely shows up as a single counter spiking — it becomes visible only when several small signals line up.
4) Testing: are you actually rehearsing failover?
In an MLAG design, confidence only comes from controlled testing. The drill I recommend:
- Start synthetic north-south and east-west traffic
- Cut a single uplink in a controlled way
- Trigger a peer-switch role transition
- Measure: loss, jitter, time to reconverge
The success criterion here is not “traffic eventually came back.” What actually matters:
- How many packets were lost?
- Which flows were affected?
- Did the upper layer kick off a retry wave?
5) Runbook: how I move during an incident
- Scope it
- Is the entire service impacted, or just specific nodes / racks?
- Separate the signal
- If no interface is down, head straight to queue/drop and retransmit data
- Validate the path
- Which flow is going through which member?
- Temporary mitigation
- Pull the suspect member out of the bundle
- Drain a specific uplink if you have to
- Permanent action
- Hash / policy revision
- Software version and vendor advisory check
The moment you break this order and say “let’s just reboot first,” you erase the most valuable evidence you had.
6) Design choices: how do you shrink the blast radius?
- For each bundle, make sure the members really are spread across different failure domains
- Size the peer-link not just for steady state — plan for the worst-case redirection
- Tie top-of-rack drills into your change process
- Align timeout and retry budgets with the application teams
An MLAG decision is not purely a network decision. If the upper-layer clients retry too aggressively, even a few seconds of loss balloons into a service-wide event.
Conclusion
In MLAG failover problems, the real cost is weak observability. When you hit “everything is up but users are complaining” type incidents, you make silent packet loss visible by widening your verification layer, running synthetic tests on a regular cadence, and measuring peer-link behavior under real load. Network reliability isn’t measured by the number of redundant links you have — it’s measured by what you can observe the moment something breaks.