The most critical part of a router/switch is not the forwarding ASIC, but the control plane: routing adjacencies, management access, ARP/ND, protocol timers… When the control plane gets stressed, the data plane “looks like it’s working” but the system rapidly becomes unstable: adjacency flap, loss of management access, even a full outage.
CoPP/CPP (the name varies by vendor) is therefore not just a “security” control but also a resilience control: it classifies, prioritizes, and limits the traffic going to the control plane.
Threat model: who stresses the control plane?
The most common sources of control-plane stress I see in the field:
- Scan/flood: ICMP, TCP SYN, scanning of management ports (SSH/SNMP)
- Bad telemetry: aggressive polling (SNMP), runaway trap storms
- Routing explosion: flap, wrong neighbor, LSA/LSP storms
- L2 anomalies: ARP/NDP bursts, loop, storm
- Bad ACL/punt: unexpected traffic gets punted to the CPU
CoPP doesn’t treat this traffic as “drop everything”; it guarantees critical classes and trims the noise.
Design principles: CoPP is not a “policy set” but a living model
The principles I’ve seen working in the field for CoPP:
- Routing first: protocols like BGP/OSPF/IS-IS go in the highest-priority class.
- Management constrained but stable: SSH/SNMP are present, but with rate limits and resource caps.
- ICMP smartly: don’t shut it off entirely; just cut the flood.
- Default drop / low rate: cap unknown punt traffic with a low threshold.
- Observation is mandatory: counters + CPU + drops are wired into the alarm set.
Class design: a minimum viable control-plane policy
Vendor-agnostic minimum classes:
1) Routing control
- BGP, OSPF/IS-IS, BFD, VRRP/HSRP (whichever you use)
- Goal: keep adjacencies stable; don’t lock the CPU even during a flap storm
2) Management access
- SSH, API, SNMPv3, TACACS/RADIUS (whatever your management plane is)
- Goal: don’t lose access to the device during an incident
3) L2/L3 infrastructure signals
- ARP/ND, DHCP relay control, NTP (depending on use)
- Goal: keep core infrastructure signals flowing while limiting them during a burst
4) ICMP and diagnostics
- Necessary traffic like ping/traceroute
- Goal: keep triage possible while cutting floods
5) Default punt / unknown
- Anything unexpectedly punted
- Goal: don’t let unknown traffic crush the CPU
Setting thresholds: answer “how many pps?” with a baseline
The most critical question: “What should the rate limit be?” The answer depends less on the device model and more on your normal traffic profile.
A practical method:
- Watch the punt traffic counters during normal days (at least 7 days)
- Pull out 95p/99p values
- Set the threshold “a bit above normal” with a burst tolerance
- Set the alarm on: approaching the limit + a rise in drops
Rollout plan: canary + waves + rollback
The safest operating model when bringing CoPP online:
- Pick a canary device (a critical but non-singular node)
- Start the policy “monitor-heavy” (with low-risk classes)
- 24–48 hours of observation: drops, CPU, adjacency
- Then roll it out wave by wave
- Rollback: have a one-command disable/rollback procedure ready
Closing
CoPP/CPP turns the control plane from “a port everyone speaks on” into a classified service plane. What this changes in the field: the two things you need most during an incident (routing stability + management access) become less brittle.