The fastest false confidence I see on SD-WAN projects is this one: “We have two circuits, SD-WAN is already smart, it’ll just switch over.” In production, however, the problem is rarely a link going down; the link looks “up” but latency/jitter/loss degrades, the application breaks, and teams burn hours arguing whether it’s “the ISP, the overlay, or the internal network.”
This post is about how I structure the SLA probe approach for path selection on SD-WAN, and how I speed up triage when an incident hits.
What does an SLA probe actually solve?
An SLA probe sends regular small packets to specific targets and produces:
- RTT / latency (p50/p95)
- jitter (especially critical for voice/video)
- packet loss (degradation thresholds vary by application)
Thanks to this, the choice isn’t about a “default route”; it’s about picking the best path for an application class.
First decision: Probe targets (wrong target = wrong decision)
Don’t pin probe targets to a single IP labelled “the internet.” Pick targets at two layers:
- Underlay targets (inside the ISP / edge router): to measure circuit quality
- Overlay targets (hub, DC, cloud edge): to measure the actual service path
Sample target set:
- Branch -> SD-WAN hub (overlay)
- Branch -> DC edge (overlay)
- Branch -> cloud region edge (overlay)
- Branch -> ISP gateway (underlay)
Application classes: Don’t use a single SLA threshold set
A common mistake: applying the same thresholds to all traffic.
Example (enterprise practice):
- Voice/Video: sensitive to jitter and loss (even small degradation hurts)
- ERP/Interactive: sensitive to latency
- Bulk/Backup: tolerates loss but cares about throughput
So your “application policy” should include these components:
- DSCP class
- SLA thresholds (latency/jitter/loss)
- Failover behaviour (fast switch or sticky?)
- Recovery behaviour (hysteresis to prevent flapping)
Flap management: Hysteresis and hold-down are mandatory
If the path-selection engine “ping-pongs” while the SLA degrades and recovers, the user experiences it as “the internet keeps cutting out.”
Minimums I recommend:
- Degrade threshold: e.g. 3 consecutive bad measurements
- Recovery threshold: e.g. 10 consecutive good measurements
- Hold-down: stay on a path for X minutes after a switch
This trio prevents “decision chaos” during an incident.
Operations: Triage runbook (classify within 15 minutes)
When degradation begins, the first goal is not “root cause”; it’s to classify the degradation.
1) Which class is affected?
- Only voice/video?
- Only ERP?
- All traffic?
The affected class usually points at the root cause (jitter -> bufferbloat/queue, loss -> physical/ISP, latency -> route change).
2) What do the probe results say?
- Underlay good, overlay bad -> look at the hub/DC side
- Underlay bad, overlay bad -> ISP / last mile
- Underlay good, overlay good but the user is complaining -> internal LAN/Wi-Fi/endpoint
3) Make the failover decision deliberately
- Even if “auto failover” is enabled, you may need a manual “freeze” during major waves
- During a large ISP incident, all branches switching to the second circuit at once can saturate that circuit too
Observability: Aggregate SD-WAN telemetry in one place
Don’t leave SLA probe output trapped in “the controller’s screen.” Centrally monitor these:
- Per-branch latency/jitter/loss trend
- Path change events
- Preferred path per application class
Vendor-independent, this data lifts the quality of your incident postmortems.
Final word
The “intelligence” of SD-WAN doesn’t render your operational reflexes unnecessary. With SLA probes and the right target/threshold design, path selection becomes truly application-centric, incident triage drops from hours to minutes, and failover decisions become more deliberate.