The biggest mistake in DDoS events is treating the technical fix as “a single move.” In reality, a good response is the combined work of a decision tree, verification, and rollback steps. RTBH (Remote Triggered Black Hole) and BGP FlowSpec, designed correctly, can hit very fast during the event; designed incorrectly, they cut the wrong traffic and produce a second incident.
Rather than “the big ISP-level story,” this post focuses on a practical runbook approach applied in the field at the corporate network and edge layer.
Prerequisite: topology and ownership boundaries must be clear
Before talking about RTBH and FlowSpec, the answers to three questions must be on paper:
- Who is speaking BGP? (Edge router, transit/peering device, or cloud router?)
- Where does the black hole go? (Null route inside the device, scrubbing center, or upstream blackhole?)
- Who decides? (NOC, NetOps, SecOps, Incident Commander?)
If these answers are not crisp, “who pushes the command?” turns into a debate during the event and burns time.
Decision tree: RTBH or FlowSpec?
The most practical split in the field:
- If the target service is fully sinking and you have reached the “shut it down is better” point → RTBH
- If part of the traffic is bad and can be filtered out → FlowSpec
Quick metrics to inform that call:
| Signal | Favors RTBH | Favors FlowSpec |
|---|---|---|
| L7 fully collapsed | Yes | No |
| Attack on a single target IP | Yes | Yes |
| Attack on specific port/proto | Partly | Yes |
| False-positive risk | Low | Medium/High |
| Application tolerance | ”Shut it” acceptable | ”Stay up” target |
RTBH: minimum safe usage template
The point of RTBH is to advertise a route specific to the target prefix with a “blackhole next-hop” so the traffic gets dropped upstream. I recommend three controls:
- Trigger only on a specific community
- Accept only specific prefix sizes (narrow targets like /32)
- Set a TTL (duration) as an operational rollback standard
Verification steps
Verification after RTBH is not just “traffic dropped”:
- Verify on the edge router that the relevant prefix points to the blackhole
- Verify on the upstream/IX side that the route propagated (looking glass, if available)
- Measure that CPU/conntrack/interrupt pressure on the target service has actually dropped
- Make the alarm storm in monitoring “expected” (label, do not silence)
FlowSpec: surgical filtering, surgical risk
FlowSpec is very powerful because you can write filters by fields like “port/proto/flags.” But the risk is this: a wrong rule cuts production traffic too.
Two safe usage patterns I rely on in the field:
- Rate-limit (slow down instead of drop)
- Only a narrow match (single target IP + single port + short duration)
Verification steps
After applying FlowSpec, watch the following two metrics together:
- Service metrics: error rate, latency, saturation
- Network metrics: PPS/BPS drop, drop counters, policer counters
If only the network metric drops while the service metric does not recover, you are intervening at the wrong place (e.g. an L7 attack, application layer).
Operational runbook: step by step
During the event, “who does what” must be short and clear:
- Triage (5 min): attack vector, target(s), impact (SLO), decision (RTBH/FlowSpec/other)
- Change record (2 min): who, when, which rule/prefix, target duration
- Apply (1–3 min): push the rule/prefix
- Verify (5 min): service + network metrics
- Rollback (planned): remove when the duration is up; collect evidence for the postmortem
Postmortem: a real improvement list after a DDoS
Even when RTBH/FlowSpec succeed, the to-do list after the event is very clear:
- Edge capacity: PPS/BPS, conntrack, interrupt tuning
- Application resilience: caching, queue, circuit breaker
- Observability: netflow/sflow, WAF logs, upstream telemetry
- Process: rule templates, on-call authority matrix, drills
Conclusion
Designed correctly, RTBH and FlowSpec save time during DDoS events; designed incorrectly, they hurt production traffic. That is why the decision tree and the rollback standard must be part of the runbook, just as much as the technical commands. On the operational leadership side, the biggest win is making “under which condition do we activate which tool?” a decision made in advance, not a question asked in panic.