DDoS Scrubbing Center Design: GRE, BGP, and Failover

The biggest misconception when DDoS comes up is this: “We’ll buy a device/service, turn it on during the attack.” In production the real picture is different. What needs to work during an attack isn’t a product; it’s the combination of routing, tunnels, capacity, observability, and communication.

The goal of a scrubbing center design is:

Get traffic to the service endpoints even under attack
Engage mitigation without producing breakage
Return safely to the prior equilibrium when disengaging

In this article I’m walking through one of the most practical patterns in an enterprise setup: routing via BGP + carrying clean traffic back via GRE.

1) Architecture: “Divert, scrub, carry back”

The simplest model:

Attack traffic arrives from the internet at the edge
The edge produces a “send this prefix to scrubbing” signal (BGP)
The scrubbing center scrubs the traffic
Clean traffic returns to the edge/service VLAN over a GRE tunnel

The advantages of this model:

Minimal touch on the application side
Cleaning traffic “in one place” (centralized control)
A runbook-friendly engage/disengage flow

The downsides:

Tunnel and MTU management (risk of fragmentation breakage)
Asymmetry and the risk of getting the return path wrong
Scrubbing capacity can become an “invisible single point”

2) Design decisions: Start with operational questions

A scrubbing design isn’t done until the following questions are answered:

Which services will go to scrubbing? (everything, or just the critical prefixes?)
What is the “engage” threshold? (pps/bps/latency/5xx?)
What’s the failover plan? (scrubbing down → what then?)
What’s the most critical metric? (is clean traffic arriving, or are we just watching the attack?)

3) GRE tunnel design: Where things break

3.1 MTU and fragmentation risk

GRE overhead lowers the effective MTU. Practical advice:

Tune the MTU at the tunnel ends deliberately (e.g. 1476/1450, depending on the environment)
If PMTUD breaks, the symptom looks like “some clients work, some don’t”
On UDP services, fragmentation breaks more invisibly; test it

3.2 Asymmetric path

The most common mistake: inbound comes from scrubbing, but outbound leaves elsewhere (and stateful devices don’t like that).

So:

If you have stateful security devices (FW/IDS), design the return path up front
Measure state-table behavior “while scrubbing is engaged”

3.3 Tunnel count and scale

In multi-POP setups, a single tunnel looks “easy” but carries risk. A better approach:

A separate tunnel per POP, or at minimum regional tunnel pairs
When one POP goes down, the others shouldn’t choke under the displaced scrubbing capacity

4) BGP signaling: What will you steer with?

Two common methods for diverting traffic to scrubbing:

Upstream community (preferred)
Prefix announce changes (via prepend/withdraw/route-map)

If communities are available, the runbook gets much cleaner: a “single-line” change diverts during the attack, and a single line returns it.

5) Runbook: Engage / disengage

5.1 Engage (during the attack)

Validate impact: did the service SLIs degrade? (timeout, 5xx, latency)
Classify the traffic: volumetric, L7, or mixed?
Confirm scrubbing capacity: current pps/bps + headroom
Apply the divert: community or policy change
Verify clean traffic: see packets/flows at the edge and in the service VLAN
Communicate: in the incident channel, share “mitigation active, validation metrics”

5.2 Disengage (calming down)

Disengage not because “the attack ended,” but because signals stabilized:

SLIs stable for 15–30 min
Drop ratio at scrubbing back to normal
Edge capacity and state tables normal
Rollback plan ready (so re-engage is fast if needed)

6) Observability: Don’t just measure the attack — measure clean traffic

Minimum dashboard:

bps/pps coming from the internet
bps/pps going into scrubbing
“clean” bps/pps coming out of scrubbing
bps/pps entering the service side at the edge (the actual truth)
Service SLI: latency/timeout/5xx

Success in scrubbing center design doesn’t come from building the tunnel; it comes from operating the engage/disengage discipline. Organizations can’t always prevent the attack; but with the right design, they can make service behavior under attack manageable.

DDoS Scrubbing Center Design: GRE, BGP, and Failover

1) Architecture: “Divert, scrub, carry back”

2) Design decisions: Start with operational questions

3) GRE tunnel design: Where things break

3.1 MTU and fragmentation risk

3.2 Asymmetric path

3.3 Tunnel count and scale

4) BGP signaling: What will you steer with?

5) Runbook: Engage / disengage

5.1 Engage (during the attack)

5.2 Disengage (calming down)

6) Observability: Don’t just measure the attack — measure clean traffic

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Preventing Edge Outages with BGP Max-Prefix Limits

BGP Traffic Engineering Runbook for the Enterprise Edge

DDoS Response Runbook with BGP RTBH and FlowSpec

1) Architecture: “Divert, scrub, carry back”

2) Design decisions: Start with operational questions

3) GRE tunnel design: Where things break

3.1 MTU and fragmentation risk

3.2 Asymmetric path

3.3 Tunnel count and scale

4) BGP signaling: What will you steer with?

5) Runbook: Engage / disengage

5.1 Engage (during the attack)

5.2 Disengage (calming down)

6) Observability: Don’t just measure the attack — measure clean traffic

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Preventing Edge Outages with BGP Max-Prefix Limits

BGP Traffic Engineering Runbook for the Enterprise Edge

DDoS Response Runbook with BGP RTBH and FlowSpec

Klavye Kısayolları