Edge Service Design with BGP Anycast: DNS and DDoS Resilience

When you tell people “I’m moving the service to the edge,” most teams reach for CDN or WAF first. But the most critical part of edge design is often the routing decision itself: how will you steer user traffic to the closest/healthiest POP (Point of Presence), how will you behave under capacity saturation, and how will you stay up under DDoS?

In this article I’m treating BGP Anycast not as a “magical proximity button,” but as an operational system: alongside health signals, automation, capacity, and attack scenarios.

What Anycast solves, and what it doesn’t

The problem it solves: Serving from multiple POPs with a single IP (or a prefix like a /24) and distributing traffic via routing. The most common use cases:

Authoritative DNS (Anycast NS)
Recursive DNS (enterprise resolver)
L4 edge (TCP/UDP proxy, DDoS scrubbing, game/voice services)
API gateway (especially for stateless and short-lived connections)

The problem it doesn’t solve: Anycast doesn’t automatically carry application-layer state. For long-lived TCP sessions, websockets, systems that need sticky sessions, or back-end calls where data consistency matters, “a closer POP” alone isn’t a win.

Core building blocks: Prefix, POP, and upstream

Think of a sound Anycast design as three layers:

Anycast prefix: Typically a /24 in IPv4 (the smallest announceable) or /48 in IPv6.
POP: At each POP, an edge (L4/L7) that binds this prefix to the service, plus a routing device.
Upstream/IX: The transits or internet exchanges where the POP runs BGP peering.

The critical questions for this setup:

In how many POPs will you announce the prefix?
Is there capacity asymmetry between POPs?
Is your upstream diversity and “path diversity” sufficient?
Which signals are needed for a POP to count as “fully healthy”?

Health signals: “BGP up” doesn’t mean healthy

The most common mistake: “If the BGP neighbor is up, the POP is healthy.” That leaves your edge open to partial failures:

The edge proxy is running but the back-end is unreachable
DNS is responding but authoritative zone sync is broken
CPU/disk is full, latency has spiked
DDoS mitigation is engaged but blocking everything via false positives

Build health signals at two levels:

Data-plane health: Latency, error rate, and TCP handshake success measured from real traffic.
Control-plane health: BGP session, route policy, config drift, certificate/zone freshness.

The practical approach: manage the prefix you announce upstream from the POP under the principle “withdraw it if the service isn’t healthy.”

Anycast failover strategies (and their side effects)

Doing failover via “BGP withdraw” looks simple, but its impact is global. Three common patterns:

1) Hard withdraw (pull the prefix entirely)

Plus: The clearest signal; traffic flows to other POPs.
Minus: Global route churn; convergence time and cache effects.

2) De-preference (be reluctant via BGP attributes)

E.g. lowering localpref, AS-path prepend, MED tuning, steering upstreams via communities.

Plus: More controlled, gradual draining.
Minus: Upstream/policy dependency; not every part of the internet behaves the same way.

3) Partial withdraw / scope reduction

Pull from some upstreams, keep others.

Plus: Good for partial situations like “capacity saturation.”
Minus: Operational complexity goes up.

Anycast under DDoS: Strengths and pitfalls

Anycast’s biggest DDoS advantage is “spreading the attack out.” Instead of hitting a single POP, the attack fans out across many POPs that BGP picks, so each POP sees a smaller share of the load.

But there are two pitfalls:

The weakest POP becomes the target: POPs with low capacity, or POPs with thin upstreams, fold first.
Routing becomes an attack vector: An attacker can produce traffic from specific regions to turn certain POPs into “hotspots.”

So a design checklist for Anycast + DDoS:

Per-POP scrubbing/rate-limit capacity and instrumentation
Upstream diversity (no dependence on a single transit)
Blackhole communities (RTBH) and automation
Architecture that separates “clean” from “dirty” traffic (when feasible)

Capacity engineering: If POPs aren’t equal, you need policy

If your POPs have different capacities (which is the case in most enterprises), “every POP announces the same prefix” doesn’t translate into load balancing. BGP doesn’t know about capacity.

Practical techniques:

Tiered POP: Big POPs sit as “prefer,” small ones as “overflow.”
Region steering: Use upstream communities to be less visible in some regions.
Graduated prepend: Increase prepend on the small POP, decrease it on the big one.

Before applying these policies, accept this: there is no “perfect balancing” in Anycast; your goal should be predictability.

Observability: What do I look at to measure Anycast?

In Anycast, the most valuable metric is not “the metric of a single POP”; it’s the global distribution.

In the field, I track these signals together:

Per POP: p50/p95 latency, error rate, conn rate, saturation (CPU/mem), upstream packet loss
Global: POP traffic share (%), POP-shift rate (churn), withdraw/de-prefer event count
For DNS: NXDOMAIN ratio, SERVFAIL, zone serial drift, resolver cache hit/miss signal (if you can capture it)

Test approach: A “withdraw” scenario without touching prod

You can’t finish an Anycast design “at the desk”; controlled testing is required.

A sample test plan:

De-prefer the prefix on a single POP (prepend/community).
Measure whether the traffic distribution shifted as expected.
Hard-withdraw on the same POP and measure convergence time.
When bringing it back, measure the “cold start” effect (cache, connection ramp-up).

Closing: Win Anycast through “operations,” not “the product”

Done right, Anycast gives edge services latency, resilience, and DDoS tolerance all at once. Done wrong, it produces a user experience that is “sometimes near, sometimes far, sometimes lost.”

In my view, a good Anycast design comes down to:

Measure health as “service is good,” not “BGP up”
Manage failover gradually (de-prefer + withdraw)
Make POP capacity asymmetry visible through routing policy
Test the DDoS scenario like it’s a normal day

Edge Service Design with BGP Anycast: DNS and DDoS Resilience

What Anycast solves, and what it doesn’t

Core building blocks: Prefix, POP, and upstream

Health signals: “BGP up” doesn’t mean healthy

Anycast failover strategies (and their side effects)

1) Hard withdraw (pull the prefix entirely)

2) De-preference (be reluctant via BGP attributes)

3) Partial withdraw / scope reduction

Anycast under DDoS: Strengths and pitfalls

Capacity engineering: If POPs aren’t equal, you need policy

Observability: What do I look at to measure Anycast?

Test approach: A “withdraw” scenario without touching prod

Closing: Win Anycast through “operations,” not “the product”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

DDoS Scrubbing Center Design: GRE, BGP, and Failover

Enterprise Edge Resolver Architecture with Anycast DNS

Preventing Edge Outages with BGP Max-Prefix Limits

What Anycast solves, and what it doesn’t

Core building blocks: Prefix, POP, and upstream

Health signals: “BGP up” doesn’t mean healthy

Anycast failover strategies (and their side effects)

1) Hard withdraw (pull the prefix entirely)

2) De-preference (be reluctant via BGP attributes)

3) Partial withdraw / scope reduction

Anycast under DDoS: Strengths and pitfalls

Capacity engineering: If POPs aren’t equal, you need policy

Observability: What do I look at to measure Anycast?

Test approach: A “withdraw” scenario without touching prod

Closing: Win Anycast through “operations,” not “the product”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

DDoS Scrubbing Center Design: GRE, BGP, and Failover

Enterprise Edge Resolver Architecture with Anycast DNS

Preventing Edge Outages with BGP Max-Prefix Limits

Klavye Kısayolları