Path Selection and Incident Triage with SLA Probes in SD-WAN

The fastest false confidence I see on SD-WAN projects is this one: “We have two circuits, SD-WAN is already smart, it’ll just switch over.” In production, however, the problem is rarely a link going down; the link looks “up” but latency/jitter/loss degrades, the application breaks, and teams burn hours arguing whether it’s “the ISP, the overlay, or the internal network.”

This post is about how I structure the SLA probe approach for path selection on SD-WAN, and how I speed up triage when an incident hits.

What does an SLA probe actually solve?

An SLA probe sends regular small packets to specific targets and produces:

RTT / latency (p50/p95)
jitter (especially critical for voice/video)
packet loss (degradation thresholds vary by application)

Thanks to this, the choice isn’t about a “default route”; it’s about picking the best path for an application class.

First decision: Probe targets (wrong target = wrong decision)

Don’t pin probe targets to a single IP labelled “the internet.” Pick targets at two layers:

Underlay targets (inside the ISP / edge router): to measure circuit quality
Overlay targets (hub, DC, cloud edge): to measure the actual service path

Sample target set:

Branch -> SD-WAN hub (overlay)
Branch -> DC edge (overlay)
Branch -> cloud region edge (overlay)
Branch -> ISP gateway (underlay)

Application classes: Don’t use a single SLA threshold set

A common mistake: applying the same thresholds to all traffic.

Example (enterprise practice):

Voice/Video: sensitive to jitter and loss (even small degradation hurts)
ERP/Interactive: sensitive to latency
Bulk/Backup: tolerates loss but cares about throughput

So your “application policy” should include these components:

DSCP class
SLA thresholds (latency/jitter/loss)
Failover behaviour (fast switch or sticky?)
Recovery behaviour (hysteresis to prevent flapping)

Flap management: Hysteresis and hold-down are mandatory

If the path-selection engine “ping-pongs” while the SLA degrades and recovers, the user experiences it as “the internet keeps cutting out.”

Minimums I recommend:

Degrade threshold: e.g. 3 consecutive bad measurements
Recovery threshold: e.g. 10 consecutive good measurements
Hold-down: stay on a path for X minutes after a switch

This trio prevents “decision chaos” during an incident.

Operations: Triage runbook (classify within 15 minutes)

When degradation begins, the first goal is not “root cause”; it’s to classify the degradation.

1) Which class is affected?

Only voice/video?
Only ERP?
All traffic?

The affected class usually points at the root cause (jitter -> bufferbloat/queue, loss -> physical/ISP, latency -> route change).

2) What do the probe results say?

Underlay good, overlay bad -> look at the hub/DC side
Underlay bad, overlay bad -> ISP / last mile
Underlay good, overlay good but the user is complaining -> internal LAN/Wi-Fi/endpoint

3) Make the failover decision deliberately

Even if “auto failover” is enabled, you may need a manual “freeze” during major waves
During a large ISP incident, all branches switching to the second circuit at once can saturate that circuit too

Observability: Aggregate SD-WAN telemetry in one place

Don’t leave SLA probe output trapped in “the controller’s screen.” Centrally monitor these:

Per-branch latency/jitter/loss trend
Path change events
Preferred path per application class

Vendor-independent, this data lifts the quality of your incident postmortems.

Final word

The “intelligence” of SD-WAN doesn’t render your operational reflexes unnecessary. With SLA probes and the right target/threshold design, path selection becomes truly application-centric, incident triage drops from hours to minutes, and failover decisions become more deliberate.

Path Selection and Incident Triage with SLA Probes in SD-WAN

What does an SLA probe actually solve?

First decision: Probe targets (wrong target = wrong decision)

Application classes: Don’t use a single SLA threshold set

Flap management: Hysteresis and hold-down are mandatory

Operations: Triage runbook (classify within 15 minutes)

1) Which class is affected?

2) What do the probe results say?

3) Make the failover decision deliberately

Observability: Aggregate SD-WAN telemetry in one place

Final word

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Time Synchronization in Critical Systems: NTP, PTP and Observability

BMC (iDRAC/iLO/IPMI) Hardening and Management Segmentation

Object Storage with Ceph: Failure Domain and Recovery Design

What does an SLA probe actually solve?

First decision: Probe targets (wrong target = wrong decision)

Application classes: Don’t use a single SLA threshold set

Flap management: Hysteresis and hold-down are mandatory

Operations: Triage runbook (classify within 15 minutes)

1) Which class is affected?

2) What do the probe results say?

3) Make the failover decision deliberately

Observability: Aggregate SD-WAN telemetry in one place

Final word

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Time Synchronization in Critical Systems: NTP, PTP and Observability

BMC (iDRAC/iLO/IPMI) Hardening and Management Segmentation

Object Storage with Ceph: Failure Domain and Recovery Design

Klavye Kısayolları