Failover is closed in most places with the line “we have a backup link.” In production, however, the real problem turns out to be: link up, even BGP/OSPF session up… but at some point traffic blackholes. If timers are measured in minutes, the incident duration is also measured in minutes.
This post explains, on edge/router servers running FRR, the approach of using BFD (Bidirectional Forwarding Detection) to generate fast signals, along with the operational risks.
What does BFD solve?
BFD checks the question “is the other end actually alive?” in very short intervals. Protocols like BGP/OSPF take this signal and converge faster.
Typical gain:
- Failover in seconds (or even hundreds of ms) instead of minutes
- Early detection in the “link up but no traffic” scenario
Where should you use it?
Field recommendation:
- Edge uplink (critical egress)
- Transit router ↔ edge critical peerings
- DC backbone (stable, low jitter) environments
Avoid:
- Wi-Fi / high jitter links
- Devices already running near CPU limits
Pre-check: how long does the current failover really take?
Take a baseline first:
- BGP: time of “session down → route withdraw → traffic returned”
- OSPF: dead interval behavior
- Application: error rate and latency impact
Without this baseline, you can’t answer “did BFD get fast?” with any clarity.
On the FRR side: a minimal enablement approach
Two principles matter for using BFD on FRR:
- Open the BFD session (peer + timers)
- Bind BFD to the routing protocol (BGP neighbor / OSPF interface)
The command/syntax may vary by FRR version; the operational flow does not.
Example (with vtysh, illustrative):
sudo vtysh -c "show bfd peers" || true
sudo vtysh -c "show ip bgp summary" || true
Timer selection: speed–stability balance
Simple rule:
- Inside DC, low jitter: more aggressive
- Internet/VPN: more conservative
Risk:
- Too aggressive timer → microburst/jitter → BFD down → route flap
Operational standard:
- After enabling BFD, monitor the “flap metric” for 24–72 hours
- If there’s flap, relax the timer or narrow the BFD scope
Validation: does it really work?
Checklist:
- Is the BFD session state Up?
- When BFD goes down, does the protocol actually withdraw the route?
- Does failover produce a visible improvement in application metrics?
Practical test approach:
- Instead of physically cutting the uplink, first do a controlled traffic blackhole simulation (lab/stage)
- Then a planned test in production “during a maintenance window”
Incident Runbook: when BFD flaps or false positives appear
- Log BFD down/up events (with timestamps)
- Check CPU and interface error counters
- Relax the timer (longer interval / higher multiplier)
- If necessary, leave BFD only on critical peers
Conclusion
BFD lowers failover from the “protocol timer” level to the “real liveness signal” level. The value emerges not in finding the most aggressive setting, but in shortening incident duration without breaking stability.