On the edge, BGP incidents tend to play out the same way: “some locations are gone”, “some prefixes disappeared”, “traffic fell into a blackhole”. The biggest problem is visibility. The show outputs on the router are valuable but they don’t capture history: once the event is over, the question “what happened?” is left hanging.
BMP (BGP Monitoring Protocol) fills that gap: it streams BGP update/withdraw flows from routers to a central collector, producing a timeline. What I call route analytics is essentially turning that timeline into alarms and a triage practice.
What does BMP give you?
The output you’re targeting with BMP is:
- Which router, from which neighbor, received which update for which prefix and when?
- What was the AS-PATH / next-hop / community change?
- Is there a withdraw wave, a route flap, or a leak?
This produces two critical wins for NOC/NetOps:
- Lifts the alarm from “link down” up to “routing behavior degraded”
- Produces “evidence” for post-incident postmortems
Architecture: Collector, storage, and dashboard
Minimum components:
- A collector layer ingesting BMP feeds from routers
- Storage for the events (think time-series + event store)
- Dashboards: prefix/peer/AS-path-centric views
- An alarm engine: thresholds and anomaly detection
Which signals are useful?
The most useful signals derived from BMP:
- Update rate spike: did the per-second update count suddenly explode?
- Withdraw wave: is there a bulk withdraw on a specific peer/prefix set?
- AS-PATH change: did an unexpected AS appear? (leak / hijack suspicion)
- Next-hop change: a sign of blackhole or misrouting
- Community change: did the policy shift? (e.g. localpref/RTBH markers)
The trick: don’t wire alarms to “every update”, wire them with context.
Incident triage: Three event types, three question sets
1) Suspected route leak
Questions:
- Which peer did the leak originate from?
- Just one edge, or multiple edges?
- Is there an unexpected hop in the AS-PATH?
Initial response approach:
- If the leak source is a peer: tighten the import policy, apply a temporary filter if needed
- If the source is internal: hunt down the wrong redistribution / wrong prefix-list / wrong community chain
2) Route flap
Questions:
- Is the flap on a single prefix or “many prefixes”?
- Is it a single peer or multiple peers?
- Does the flap overlap with a maintenance window?
Because the BMP timeline pinpoints the “start moment” of the flap, it dramatically narrows the root-cause window.
3) Blackhole / asymmetric reachability
Questions:
- Did the next-hop change?
- Are only certain locations affected?
- Is the same prefix announced differently from different upstreams?
Operational limits and risks
- Router CPU/memory impact: correct configuration is mandatory (especially under heavy update volume)
- Data sensitivity: prefix and neighbor info can be critical for the organization; access control is mandatory
- Storage cost: not “infinite logs”, just enough retention for the need
Conclusion
Route analytics is not an expensive NMS project; it is a practice for making routing behavior measurable. Once you turn on BMP in the right place and wire alarms to the right signals, BGP incidents stop being “mysterious edge problems” and turn into evidence-based triage and postmortem discipline.