İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

Edge Service Design with BGP Anycast: DNS and DDoS Resilience

A practical edge design guide that addresses routing, health signals, capacity, and attack scenarios together to see Anycast's real benefits.

Edge Service Design with BGP Anycast: DNS and DDoS Resilience — cover image

When you tell people “I’m moving the service to the edge,” most teams reach for CDN or WAF first. But the most critical part of edge design is often the routing decision itself: how will you steer user traffic to the closest/healthiest POP (Point of Presence), how will you behave under capacity saturation, and how will you stay up under DDoS?

In this article I’m treating BGP Anycast not as a “magical proximity button,” but as an operational system: alongside health signals, automation, capacity, and attack scenarios.

What Anycast solves, and what it doesn’t

The problem it solves: Serving from multiple POPs with a single IP (or a prefix like a /24) and distributing traffic via routing. The most common use cases:

  • Authoritative DNS (Anycast NS)
  • Recursive DNS (enterprise resolver)
  • L4 edge (TCP/UDP proxy, DDoS scrubbing, game/voice services)
  • API gateway (especially for stateless and short-lived connections)

The problem it doesn’t solve: Anycast doesn’t automatically carry application-layer state. For long-lived TCP sessions, websockets, systems that need sticky sessions, or back-end calls where data consistency matters, “a closer POP” alone isn’t a win.

Core building blocks: Prefix, POP, and upstream

Think of a sound Anycast design as three layers:

  1. Anycast prefix: Typically a /24 in IPv4 (the smallest announceable) or /48 in IPv6.
  2. POP: At each POP, an edge (L4/L7) that binds this prefix to the service, plus a routing device.
  3. Upstream/IX: The transits or internet exchanges where the POP runs BGP peering.

The critical questions for this setup:

  • In how many POPs will you announce the prefix?
  • Is there capacity asymmetry between POPs?
  • Is your upstream diversity and “path diversity” sufficient?
  • Which signals are needed for a POP to count as “fully healthy”?

Health signals: “BGP up” doesn’t mean healthy

The most common mistake: “If the BGP neighbor is up, the POP is healthy.” That leaves your edge open to partial failures:

  • The edge proxy is running but the back-end is unreachable
  • DNS is responding but authoritative zone sync is broken
  • CPU/disk is full, latency has spiked
  • DDoS mitigation is engaged but blocking everything via false positives

Build health signals at two levels:

  • Data-plane health: Latency, error rate, and TCP handshake success measured from real traffic.
  • Control-plane health: BGP session, route policy, config drift, certificate/zone freshness.

The practical approach: manage the prefix you announce upstream from the POP under the principle “withdraw it if the service isn’t healthy.”

Anycast failover strategies (and their side effects)

Doing failover via “BGP withdraw” looks simple, but its impact is global. Three common patterns:

1) Hard withdraw (pull the prefix entirely)

  • Plus: The clearest signal; traffic flows to other POPs.
  • Minus: Global route churn; convergence time and cache effects.

2) De-preference (be reluctant via BGP attributes)

E.g. lowering localpref, AS-path prepend, MED tuning, steering upstreams via communities.

  • Plus: More controlled, gradual draining.
  • Minus: Upstream/policy dependency; not every part of the internet behaves the same way.

3) Partial withdraw / scope reduction

Pull from some upstreams, keep others.

  • Plus: Good for partial situations like “capacity saturation.”
  • Minus: Operational complexity goes up.

Anycast under DDoS: Strengths and pitfalls

Anycast’s biggest DDoS advantage is “spreading the attack out.” Instead of hitting a single POP, the attack fans out across many POPs that BGP picks, so each POP sees a smaller share of the load.

But there are two pitfalls:

  1. The weakest POP becomes the target: POPs with low capacity, or POPs with thin upstreams, fold first.
  2. Routing becomes an attack vector: An attacker can produce traffic from specific regions to turn certain POPs into “hotspots.”

So a design checklist for Anycast + DDoS:

  • Per-POP scrubbing/rate-limit capacity and instrumentation
  • Upstream diversity (no dependence on a single transit)
  • Blackhole communities (RTBH) and automation
  • Architecture that separates “clean” from “dirty” traffic (when feasible)

Capacity engineering: If POPs aren’t equal, you need policy

If your POPs have different capacities (which is the case in most enterprises), “every POP announces the same prefix” doesn’t translate into load balancing. BGP doesn’t know about capacity.

Practical techniques:

  • Tiered POP: Big POPs sit as “prefer,” small ones as “overflow.”
  • Region steering: Use upstream communities to be less visible in some regions.
  • Graduated prepend: Increase prepend on the small POP, decrease it on the big one.

Before applying these policies, accept this: there is no “perfect balancing” in Anycast; your goal should be predictability.

Observability: What do I look at to measure Anycast?

In Anycast, the most valuable metric is not “the metric of a single POP”; it’s the global distribution.

In the field, I track these signals together:

  • Per POP: p50/p95 latency, error rate, conn rate, saturation (CPU/mem), upstream packet loss
  • Global: POP traffic share (%), POP-shift rate (churn), withdraw/de-prefer event count
  • For DNS: NXDOMAIN ratio, SERVFAIL, zone serial drift, resolver cache hit/miss signal (if you can capture it)

Test approach: A “withdraw” scenario without touching prod

You can’t finish an Anycast design “at the desk”; controlled testing is required.

A sample test plan:

  1. De-prefer the prefix on a single POP (prepend/community).
  2. Measure whether the traffic distribution shifted as expected.
  3. Hard-withdraw on the same POP and measure convergence time.
  4. When bringing it back, measure the “cold start” effect (cache, connection ramp-up).

Closing: Win Anycast through “operations,” not “the product”

Done right, Anycast gives edge services latency, resilience, and DDoS tolerance all at once. Done wrong, it produces a user experience that is “sometimes near, sometimes far, sometimes lost.”

In my view, a good Anycast design comes down to:

  • Measure health as “service is good,” not “BGP up”
  • Manage failover gradually (de-prefer + withdraw)
  • Make POP capacity asymmetry visible through routing policy
  • Test the DDoS scenario like it’s a normal day
Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts