The biggest misconception when DDoS comes up is this: “We’ll buy a device/service, turn it on during the attack.” In production the real picture is different. What needs to work during an attack isn’t a product; it’s the combination of routing, tunnels, capacity, observability, and communication.
The goal of a scrubbing center design is:
- Get traffic to the service endpoints even under attack
- Engage mitigation without producing breakage
- Return safely to the prior equilibrium when disengaging
In this article I’m walking through one of the most practical patterns in an enterprise setup: routing via BGP + carrying clean traffic back via GRE.
1) Architecture: “Divert, scrub, carry back”
The simplest model:
- Attack traffic arrives from the internet at the edge
- The edge produces a “send this prefix to scrubbing” signal (BGP)
- The scrubbing center scrubs the traffic
- Clean traffic returns to the edge/service VLAN over a GRE tunnel
The advantages of this model:
- Minimal touch on the application side
- Cleaning traffic “in one place” (centralized control)
- A runbook-friendly engage/disengage flow
The downsides:
- Tunnel and MTU management (risk of fragmentation breakage)
- Asymmetry and the risk of getting the return path wrong
- Scrubbing capacity can become an “invisible single point”
2) Design decisions: Start with operational questions
A scrubbing design isn’t done until the following questions are answered:
- Which services will go to scrubbing? (everything, or just the critical prefixes?)
- What is the “engage” threshold? (pps/bps/latency/5xx?)
- What’s the failover plan? (scrubbing down → what then?)
- What’s the most critical metric? (is clean traffic arriving, or are we just watching the attack?)
3) GRE tunnel design: Where things break
3.1 MTU and fragmentation risk
GRE overhead lowers the effective MTU. Practical advice:
- Tune the MTU at the tunnel ends deliberately (e.g. 1476/1450, depending on the environment)
- If PMTUD breaks, the symptom looks like “some clients work, some don’t”
- On UDP services, fragmentation breaks more invisibly; test it
3.2 Asymmetric path
The most common mistake: inbound comes from scrubbing, but outbound leaves elsewhere (and stateful devices don’t like that).
So:
- If you have stateful security devices (FW/IDS), design the return path up front
- Measure state-table behavior “while scrubbing is engaged”
3.3 Tunnel count and scale
In multi-POP setups, a single tunnel looks “easy” but carries risk. A better approach:
- A separate tunnel per POP, or at minimum regional tunnel pairs
- When one POP goes down, the others shouldn’t choke under the displaced scrubbing capacity
4) BGP signaling: What will you steer with?
Two common methods for diverting traffic to scrubbing:
- Upstream community (preferred)
- Prefix announce changes (via prepend/withdraw/route-map)
If communities are available, the runbook gets much cleaner: a “single-line” change diverts during the attack, and a single line returns it.
5) Runbook: Engage / disengage
5.1 Engage (during the attack)
- Validate impact: did the service SLIs degrade? (timeout, 5xx, latency)
- Classify the traffic: volumetric, L7, or mixed?
- Confirm scrubbing capacity: current pps/bps + headroom
- Apply the divert: community or policy change
- Verify clean traffic: see packets/flows at the edge and in the service VLAN
- Communicate: in the incident channel, share “mitigation active, validation metrics”
5.2 Disengage (calming down)
Disengage not because “the attack ended,” but because signals stabilized:
- SLIs stable for 15–30 min
- Drop ratio at scrubbing back to normal
- Edge capacity and state tables normal
- Rollback plan ready (so re-engage is fast if needed)
6) Observability: Don’t just measure the attack — measure clean traffic
Minimum dashboard:
- bps/pps coming from the internet
- bps/pps going into scrubbing
- “clean” bps/pps coming out of scrubbing
- bps/pps entering the service side at the edge (the actual truth)
- Service SLI: latency/timeout/5xx
Success in scrubbing center design doesn’t come from building the tunnel; it comes from operating the engage/disengage discipline. Organizations can’t always prevent the attack; but with the right design, they can make service behavior under attack manageable.