In network operations there are two extremes: either you stare at a single “interface utilization” graph, or you try to capture every packet and drown in the data. The third path, the one that’s actually sustainable in production, is flow telemetry: with IPFIX/NetFlow/sFlow you can answer “who talked to whom, how much, and when” with enough fidelity to operate.
In this post, I walk through a realistic flow pipeline design that supports DDoS triage and capacity/peering decisions.
What does flow telemetry buy you?
Flow becomes a “game changer” especially in these scenarios:
- DDoS: attack vector (protocol/port), top talkers, target prefix/service
- Capacity: which applications fill the link, what hours they spike
- Anomaly: a new destination country/ASN, an unexpected port, “high fan-out” behavior
- Incident: fast evidence for the question “which segment was talking?”
Pipeline components (minimal but sufficient)
The minimum architecture that has worked for me in practice:
- Exporter: IPFIX/NetFlow on the router/switch/firewall
- Collector: receives UDP, normalizes (HA where possible)
- Enrichment: ASN/GeoIP, prefix, application labels
- Storage: fast querying (usually a columnar DB)
- Dashboard/Alert: ready-made panels for DDoS triage and capacity
On the exporter side: right place, right rate
Where you export flow from is critical:
- Edge uplink: DDoS and transit/peering visibility
- DC core: east-west density, critical segments
- Firewall: correlation with policy/zone context (vendor dependent)
Be deliberate about sampling:
- For DDoS and volumetric visibility, sampling (e.g. 1/1000) is usually sufficient.
- For low-volume but critical flows (auth/management), aggressive sampling can cause you to miss the signal.
On the collector side: UDP reality and resilience
Production realities of a flow collector:
- UDP packet loss happens; treat it as a “design assumption.”
- If collector capacity fills up, data loss is silent.
- For that reason, instrument the collector itself:
ingest_qps,dropped_packets,queue_depth,cpu,disk.
Two practical approaches for HA:
- If your exporters support two collector targets (active/active), use it.
- If not: anycast VIP + stateless collector (though loss/dedup discussions still apply).
Enrichment: raw flow alone is not enough
Enrichments that increase operational value:
- ASN/GeoIP: a change in source/destination ASN produces an anomaly signal
- Prefix map: speeds up the “which service/prefix is the target” question
- Port map: 443 isn’t always “HTTPS,” but it’s a good baseline
- Device/zone tag: which edge/DC/segment
Query model: design around triage questions
The questions I most often ask during DDoS triage:
- What’s the top
dst_ip/dst_prefixat the target? - What does the top
protocol/portdistribution look like? - What are the top
src_asn/src_country? - Compared to “normal baseline,” where did the increase begin?
For fast answers, presets like “last 15 min, 1 hour, 24 hours” and pre-built queries are essential.
Alert logic: “fast signal, low noise”
Simple but useful alert examples:
- Threshold breach on
bpsorppsfor a specific prefix/service (against baseline) - A newly-seen
dst_port(suddenly rising when never present in prod) - Excessive surge from a single
src_asn
Runbook: produce a DDoS picture in 5 minutes with flow
My practical “first 5 minutes” sequence:
- Identify the target prefix/service (LB VIP, anycast prefix, app subnet)
- Pull top
dst_port/protocolfor the last 5–10 min - Pull top
src_asnand topsrc_country - If you see known vectors like
udp/53,udp/123,udp/1900, speak the same language to the upstream - Make the mitigation decision: RTBH/FlowSpec/scrubbing/WAF (depending on service type)
With this discipline, flow produces “evidence” instead of “I had a feeling there was an attack.”
Conclusion
A telemetry pipeline based on IPFIX/NetFlow lets you make faster and more accurate decisions during DDoS, and strengthens capacity and anomaly visibility in normal times. It isn’t as heavy as packet capture, and it isn’t as blind as an SNMP graph. With the right sampling, good enrichment, and clear triage questions, flow telemetry becomes one of the most efficient signal sources in network operations.