İçeriğe Atla
Mustafa Erbay
Technology · 10 min read · görüntülenme Türkçe oku
100%

Preventing Edge Outages with BGP Max-Prefix Limits

Designing, monitoring, and writing an incident runbook for the max-prefix guardrail that protects edge routers during route leaks and bad-prefix waves.

Preventing Edge Outages with BGP Max-Prefix Limits — cover image

Every organization that speaks BGP at the edge carries a quiet risk: a wave of bad prefixes. Sometimes that wave is a route leak coming from upstream, sometimes it’s a configuration mistake. The outcome is usually the same:

  • RIB/FIB inflate, CPU climbs
  • The control plane lags
  • BGP sessions flap
  • The “real problem” turns into an “internet is down” incident in a heartbeat

In this article I’m walking through one of the highest-leverage guardrails for shrinking that class of incidents: the max-prefix limit.

Why max-prefix is an “operational” control

Max-prefix limit looks like just a BGP knob, but it actually answers this question:

“How many prefixes do I expect from this neighbor (peer/upstream), and what will I do if that number deviates?”

So max-prefix gives you:

  • Error prevention (an automatic brake during a leak wave)
  • Alarm generation (warning thresholds)
  • A runbook trigger (who calls whom)

1) First step: Establish a normal prefix baseline

Don’t turn max-prefix on with an arbitrary number. Start with the baseline:

  • Measure prefix counts for 7–14 days
  • Note weekly variance (trend) and anomalous days like “patch day”

Two numbers matter:

  • Normal (median)
  • Peak (95th/99th percentile)

The limit needs to accommodate both, but be positioned so it still catches a “leak.”

2) Design: A 3-layer guardrail

The most stable model in the field:

  1. Warning threshold: Alarm at 80–90%
  2. Hard limit: A specific “trip” count
  3. Trip behavior: Should the session drop, should the routes go away, or just log?

Trip behavior depends on the organization’s risk appetite:

  • On some edges, “tear down the session” is better (don’t accept broken information)
  • On others, “keep the session up but stop accepting new prefixes” (if the vendor supports it)

What matters: this behavior should not be a surprise during an incident.

3) Practical configuration examples (vendor-agnostic)

The examples below illustrate the concept; adapt them to your own device.

Junos (example)

set protocols bgp group TRANSIT neighbor <peer> family inet unicast prefix-limit maximum 140000
set protocols bgp group TRANSIT neighbor <peer> family inet unicast prefix-limit teardown 5
set protocols bgp group TRANSIT neighbor <peer> family inet unicast prefix-limit idle-timeout 60

Cisco IOS-XR (example)

neighbor <peer>
 address-family ipv4 unicast
  maximum-prefix 140000 90 restart 1

The key nuance here: 90 typically acts as the warning threshold; it’s used for the “approaching the limit” alarm.

4) Monitoring: Which alarm actually helps?

Set three alarms cleanly:

  • Prefix count threshold exceeded (warning)
  • Session flap (stability)
  • Control-plane CPU / route processing duration (capacity)

When these alarms come together, “is this a leak or normal growth?” gets faster to answer.

5) Incident runbook: When max-prefix trips

  1. Quick verification: Is it really max-prefix?
    • A “prefix limit exceeded”-style entry in the logs
    • Prefix count spike in NMS/telemetry
  2. Impact: Which services were affected? (internet egress, partner)
  3. Source: Which peer? (transit/IX/partner)
  4. Decision:
    • If the leak is upstream: escalation to the upstream + temporary filter
    • If it’s our side: last config diff, last maintenance, change id
  5. Temporary mitigation (with explicit risk acceptance):
    • Raise the limit briefly (only if there’s evidence)
    • Or shut the session in a controlled way (so the broken route doesn’t enter)
  6. Permanent action:
    • Tighten the prefix-filter/policy
    • Update the max-prefix value against the new baseline
    • Postmortem: “why didn’t the alarm fire earlier?”

Conclusion

Max-prefix limit is one of the lowest-cost, highest-impact guardrails at the edge. Its value shows up not when a route leak happens, but when one doesn’t: it protects the control plane, shrinks the incident blast radius, and reduces uncertainty during the decision moment. What makes a difference in the field isn’t writing the command; it’s establishing the baseline, choosing the alarm thresholds correctly, and actually running the runbook.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts