İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

Path Selection and Incident Triage with SLA Probes in SD-WAN

Choosing the right path for application classes via active probes that measure latency/jitter/loss; rapid diagnosis during degradation and a controlled…

Path Selection and Incident Triage with SLA Probes in SD-WAN — cover image

The fastest false confidence I see on SD-WAN projects is this one: “We have two circuits, SD-WAN is already smart, it’ll just switch over.” In production, however, the problem is rarely a link going down; the link looks “up” but latency/jitter/loss degrades, the application breaks, and teams burn hours arguing whether it’s “the ISP, the overlay, or the internal network.”

This post is about how I structure the SLA probe approach for path selection on SD-WAN, and how I speed up triage when an incident hits.

What does an SLA probe actually solve?

An SLA probe sends regular small packets to specific targets and produces:

  • RTT / latency (p50/p95)
  • jitter (especially critical for voice/video)
  • packet loss (degradation thresholds vary by application)

Thanks to this, the choice isn’t about a “default route”; it’s about picking the best path for an application class.

First decision: Probe targets (wrong target = wrong decision)

Don’t pin probe targets to a single IP labelled “the internet.” Pick targets at two layers:

  1. Underlay targets (inside the ISP / edge router): to measure circuit quality
  2. Overlay targets (hub, DC, cloud edge): to measure the actual service path

Sample target set:

  • Branch -> SD-WAN hub (overlay)
  • Branch -> DC edge (overlay)
  • Branch -> cloud region edge (overlay)
  • Branch -> ISP gateway (underlay)

Application classes: Don’t use a single SLA threshold set

A common mistake: applying the same thresholds to all traffic.

Example (enterprise practice):

  • Voice/Video: sensitive to jitter and loss (even small degradation hurts)
  • ERP/Interactive: sensitive to latency
  • Bulk/Backup: tolerates loss but cares about throughput

So your “application policy” should include these components:

  • DSCP class
  • SLA thresholds (latency/jitter/loss)
  • Failover behaviour (fast switch or sticky?)
  • Recovery behaviour (hysteresis to prevent flapping)

Flap management: Hysteresis and hold-down are mandatory

If the path-selection engine “ping-pongs” while the SLA degrades and recovers, the user experiences it as “the internet keeps cutting out.”

Minimums I recommend:

  • Degrade threshold: e.g. 3 consecutive bad measurements
  • Recovery threshold: e.g. 10 consecutive good measurements
  • Hold-down: stay on a path for X minutes after a switch

This trio prevents “decision chaos” during an incident.

Operations: Triage runbook (classify within 15 minutes)

When degradation begins, the first goal is not “root cause”; it’s to classify the degradation.

1) Which class is affected?

  • Only voice/video?
  • Only ERP?
  • All traffic?

The affected class usually points at the root cause (jitter -> bufferbloat/queue, loss -> physical/ISP, latency -> route change).

2) What do the probe results say?

  • Underlay good, overlay bad -> look at the hub/DC side
  • Underlay bad, overlay bad -> ISP / last mile
  • Underlay good, overlay good but the user is complaining -> internal LAN/Wi-Fi/endpoint

3) Make the failover decision deliberately

  • Even if “auto failover” is enabled, you may need a manual “freeze” during major waves
  • During a large ISP incident, all branches switching to the second circuit at once can saturate that circuit too

Observability: Aggregate SD-WAN telemetry in one place

Don’t leave SLA probe output trapped in “the controller’s screen.” Centrally monitor these:

  • Per-branch latency/jitter/loss trend
  • Path change events
  • Preferred path per application class

Vendor-independent, this data lifts the quality of your incident postmortems.

Final word

The “intelligence” of SD-WAN doesn’t render your operational reflexes unnecessary. With SLA probes and the right target/threshold design, path selection becomes truly application-centric, incident triage drops from hours to minutes, and failover decisions become more deliberate.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts