İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

BGP Traffic Engineering Runbook for the Enterprise Edge

A practical runbook for steering traffic with localpref, community, prepend, and MED in multi-ISP and multi-POP environments — measurable and reversible.

BGP Traffic Engineering Runbook for the Enterprise Edge — cover image

The day you start speaking BGP at the enterprise edge, you don’t just open up “internet egress”; you also stand up a control plane where you manage the flow of traffic. The problem is this: for most teams, BGP Traffic Engineering (TE) ends up being “turning a few knobs.” The outcome is predictable:

  • Unintended inbound/outbound shifts
  • POP/ISP imbalance, capacity surprises
  • Panic-driven changes during incidents
  • “Permanent temporary fixes” with no rollback plan

This article presents the most useful TE tools at the enterprise edge (localpref, community, prepend, MED) in a runbook format, with the logic of “which one, and when?“

1) The first distinction: outbound or inbound?

To run TE properly, separate two questions clearly:

  1. Outbound (egress): Which ISP/POP should outbound traffic from your organization use?
  2. Inbound (ingress): Which ISP/POP should incoming traffic from the internet arrive on?

This distinction is critical because most knobs only affect one direction:

  • LocalPref: typically outbound selection (internal policy)
  • AS-path prepend: typically inbound effect (the path others see)
  • MED: can affect inbound but only under specific conditions (same upstream/AS)
  • Community: a “do this for me” signal to the upstream (can be inbound/outbound)

2) Minimum observation set (before making a change)

The most expensive mistake in TE: you made the change but didn’t measure what you changed.

Minimum signals before the change:

  • BGP session health: session up/down, flap count
  • Prefix/route counts: accepted/announced prefixes, “expected vs observed”
  • Traffic distribution: bps/pps/flow per ISP/POP
  • Service impact: latency/timeout for critical services (internet egress affects them)
  • DNS/Anycast (if applicable): query/rcode distribution per POP

Operational practice: define a 15–30 minute observation window for the “TE change” and read the same window again after the change.

3) Tool selection: which knob, when?

3.1 For outbound: LocalPref (the primary tool)

LocalPref is the most deterministic method for outbound selection (internal).

When to use it:

  • “Egress through ISP-A, ISP-B as backup”
  • “POP-1 egress is saturated; bring POP-2 online”
  • “Different egress for specific destination ASes”

Runbook step:

  1. Write the goal: a clear measure like “Outbound 70% ISP-A, 30% ISP-B”
  2. Apply only to a single class of routes: e.g. “default route” or “transit learned”
  3. Start at one POP (ring rollout)
  4. Rollback: keep the previous localpref value at the ready

3.2 For inbound: Community (the cleanest tool, when available)

Upstreams typically provide controls of these kinds via communities:

  • Prepend to a specific POP/region
  • Blackhole / RTBH
  • Localpref manipulation (within the upstream)
  • Propagation limits like “no-export”

When to use it:

  • “I want control inside the ISP’s network”
  • “Temporarily de-prefer a POP on inbound”

Risk: if the community contract isn’t documented or different teams interpret it differently, “one line of config” creates large effects.

3.3 For inbound: AS-path prepend (most common but most uncertain)

Prepend influences remote selection indirectly. So treat it less as a “fine adjustment” and more as a coarse steering tool.

When to use it:

  • When upstream community options are limited
  • When you want to create “preference” between two ISPs

Operational rules:

  • Don’t go aggressive in one shot; step it up gradually (e.g. +1, then +2)
  • Balance capacity with different prepend per POP/ISP
  • Don’t make it “permanent” without measuring its effect; revisit within 24 hours

3.4 MED: only meaningful in the right context

MED is generally meaningful for selecting between different entry points within the same upstream AS. In multi-ISP scenarios, it usually doesn’t deliver the effect you expect.

Rule: don’t use MED like a “lifesaver knob”; use it as a bounded signal.

4) Change flow (operational model)

The flow that has worked best for me in the field:

  1. Goal sentence: “20% of inbound traffic will shift to POP-2”
  2. Scope: which prefixes? entire internet, or a specific service?
  3. Observation: which dashboards/metrics are the decision criteria?
  4. Rollback: is one-command rollback possible?
  5. Time box: 30 min observation, then decision
  6. Record: change log (what, why, what happened)

5) Incident triage: when traffic goes the “wrong” way

5.1 Initial checklist (5 minutes)

  • Are BGP sessions stable? Any flaps?
  • Has the announced prefix set changed? (missing/extra announcements)
  • Did route-map/policy ordering change?
  • Did intra-POP routing on the anycast/ECMP/IGP side break?
  • Is there maintenance or a policy change on the upstream side?

5.2 Quick action (least risky)

The least-risky rollback is usually this:

  • Outbound problem: revert localpref to the previous value
  • Inbound problem: undo the community/prepend you added

The reflex of “fix by adding a new setting” extends the crisis. First return the system to its previous stable state, then root cause.

6) Minimum viable TE checklist

  • Inbound/outbound goals written separately
  • TE change tested at one POP (ring rollout)
  • 30 min before/after observation window recorded
  • Rollback command at the ready
  • Upstream community contract documented
  • A “TE triage” runbook for incident time exists

At the enterprise edge, BGP TE isn’t about “making the network prettier”; it’s a job done to manage operational risk. Even when the knobs stay the same, the outcome changes when the approach changes: with a goal sentence, measurement, and rollback, TE stops being a source of surprises.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts