İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

DDoS Scrubbing Center Design: GRE, BGP, and Failover

GRE tunnels, BGP signaling, capacity, and an operational runbook to keep the service up by diverting traffic to scrubbing during an attack.

DDoS Scrubbing Center Design: GRE, BGP, and Failover — cover image

The biggest misconception when DDoS comes up is this: “We’ll buy a device/service, turn it on during the attack.” In production the real picture is different. What needs to work during an attack isn’t a product; it’s the combination of routing, tunnels, capacity, observability, and communication.

The goal of a scrubbing center design is:

  • Get traffic to the service endpoints even under attack
  • Engage mitigation without producing breakage
  • Return safely to the prior equilibrium when disengaging

In this article I’m walking through one of the most practical patterns in an enterprise setup: routing via BGP + carrying clean traffic back via GRE.

1) Architecture: “Divert, scrub, carry back”

The simplest model:

  1. Attack traffic arrives from the internet at the edge
  2. The edge produces a “send this prefix to scrubbing” signal (BGP)
  3. The scrubbing center scrubs the traffic
  4. Clean traffic returns to the edge/service VLAN over a GRE tunnel

The advantages of this model:

  • Minimal touch on the application side
  • Cleaning traffic “in one place” (centralized control)
  • A runbook-friendly engage/disengage flow

The downsides:

  • Tunnel and MTU management (risk of fragmentation breakage)
  • Asymmetry and the risk of getting the return path wrong
  • Scrubbing capacity can become an “invisible single point”

2) Design decisions: Start with operational questions

A scrubbing design isn’t done until the following questions are answered:

  • Which services will go to scrubbing? (everything, or just the critical prefixes?)
  • What is the “engage” threshold? (pps/bps/latency/5xx?)
  • What’s the failover plan? (scrubbing down → what then?)
  • What’s the most critical metric? (is clean traffic arriving, or are we just watching the attack?)

3) GRE tunnel design: Where things break

3.1 MTU and fragmentation risk

GRE overhead lowers the effective MTU. Practical advice:

  • Tune the MTU at the tunnel ends deliberately (e.g. 1476/1450, depending on the environment)
  • If PMTUD breaks, the symptom looks like “some clients work, some don’t”
  • On UDP services, fragmentation breaks more invisibly; test it

3.2 Asymmetric path

The most common mistake: inbound comes from scrubbing, but outbound leaves elsewhere (and stateful devices don’t like that).

So:

  • If you have stateful security devices (FW/IDS), design the return path up front
  • Measure state-table behavior “while scrubbing is engaged”

3.3 Tunnel count and scale

In multi-POP setups, a single tunnel looks “easy” but carries risk. A better approach:

  • A separate tunnel per POP, or at minimum regional tunnel pairs
  • When one POP goes down, the others shouldn’t choke under the displaced scrubbing capacity

4) BGP signaling: What will you steer with?

Two common methods for diverting traffic to scrubbing:

  1. Upstream community (preferred)
  2. Prefix announce changes (via prepend/withdraw/route-map)

If communities are available, the runbook gets much cleaner: a “single-line” change diverts during the attack, and a single line returns it.

5) Runbook: Engage / disengage

5.1 Engage (during the attack)

  • Validate impact: did the service SLIs degrade? (timeout, 5xx, latency)
  • Classify the traffic: volumetric, L7, or mixed?
  • Confirm scrubbing capacity: current pps/bps + headroom
  • Apply the divert: community or policy change
  • Verify clean traffic: see packets/flows at the edge and in the service VLAN
  • Communicate: in the incident channel, share “mitigation active, validation metrics”

5.2 Disengage (calming down)

Disengage not because “the attack ended,” but because signals stabilized:

  • SLIs stable for 15–30 min
  • Drop ratio at scrubbing back to normal
  • Edge capacity and state tables normal
  • Rollback plan ready (so re-engage is fast if needed)

6) Observability: Don’t just measure the attack — measure clean traffic

Minimum dashboard:

  • bps/pps coming from the internet
  • bps/pps going into scrubbing
  • “clean” bps/pps coming out of scrubbing
  • bps/pps entering the service side at the edge (the actual truth)
  • Service SLI: latency/timeout/5xx

Success in scrubbing center design doesn’t come from building the tunnel; it comes from operating the engage/disengage discipline. Organizations can’t always prevent the attack; but with the right design, they can make service behavior under attack manageable.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts