İçeriğe Atla
Mustafa Erbay
Technology · 11 min read · görüntülenme Türkçe oku
100%

DSCP and QoS on the WAN: End-to-End Prioritization

A guide to running QoS not as a magic wand but as an operational discipline managed with end-to-end measurement and a real trust boundary.

DSCP and QoS on the WAN: End-to-End Prioritization — cover image

Once a WAN bottleneck appears, the first reflex is usually “let’s enable QoS so that critical traffic survives”. Configured correctly, QoS really does save the day; configured wrong, it hides the problem, makes diagnosis harder and misleads the team.

I’ll skip the long theory and write the way I run it in the field: DSCP marking + trust boundary + queueing policy + measurement.

1) Define classes first (by flow, not by team)

QoS classes should not be defined by “department” or “application name” — they should be defined by flow behavior:

  • Interactive: low latency/jitter (VoIP, VDI, terminal)
  • Transactional: short bursts, error-sensitive (API, payment, login)
  • Bulk: large transfers, latency-tolerant (backup, artifact, replication)
  • Control-plane: routing, keepalive, monitoring

My recommendation: don’t go past 3–5 classes early on. Many classes plus many exceptions equals operational debt.

2) DSCP: not a “label”, a contract

Think of DSCP this way:

  • You mark at the edge (marking)
  • You preserve it through the network (trust boundary)
  • You convert it into behavior at the bottleneck (queueing/shaping)

When you choose a DSCP class, write a short “contract” document:

  • The DSCP value
  • Which flows it covers
  • Where it gets marked (ingress)
  • Where it gets rewritten (remark)
  • Where it is “trusted”

No QoS without a trust boundary

If every device “trusts” DSCP, a single mis-marking service can hold the entire WAN hostage.

A practical rule:

  • On the access/host side do not trust DSCP — rewrite it
  • Put the trust boundary on the DC edge / SD-WAN edge
  • On the WAN core, work with the “trusted classes”

Applying QoS only on the WAN link is often not enough. Bottlenecks also live in:

  • The internet egress
  • VPN/IPsec tunnels (encryption CPU, MTU, fragmentation)
  • Cloud interconnect (rate limit / policing)
  • Firewall/NGFW throughput

Which is why a “QoS rollout” should start with a bottleneck map:

  • Link capacity
  • Real usage (p95)
  • Drop reason (queue tail drop or policer?)
  • MTU/fragment indicators

4) Queueing policy: not “priority” but “fair share”

Two common mistakes:

  1. Marking everything as “high priority”
  2. Giving high priority unlimited bandwidth

My approach:

  • Give the interactive class a low but guaranteed bandwidth on a low-latency queue
  • Give the transactional class guaranteed + burst
  • Give the bulk class the leftover bandwidth and aggressive shaping
  • Give control-plane a small but untouchable slice

5) MTU and tunnels: the quiet killer of QoS

If QoS looks correct on the WAN but users still complain, my checklist:

  • After IPsec overhead, what is the effective MTU?
  • Is MSS clamping in place?
  • Are fragments/drops climbing?
  • Is DSCP being lost inside the tunnel (encapsulation remark)?

Prove that DSCP is carried across tunnel ingress/egress with a short packet capture (inner + outer header).

6) Rollout plan (operationally safe)

This is not “global enable in one shot” — it has to be done ring by ring:

  1. Define the classes (document + ownership)
  2. At the edge, do marking (passive: only mark, no queueing yet)
  3. Measure: DSCP distribution, mis-marked flows
  4. At egress, enable queueing/shaping (canary site)
  5. SLO: p95 latency/jitter targets
  6. Runbook: rollback (single command / single policy)

7) Success criteria (measurable)

The picture I call “QoS successful”:

  • During a bottleneck, interactive p95/jitter is preserved
  • The transactional error rate does not climb (timeouts/5xx)
  • Bulk transfers slow down but do not get “killed”
  • There is an alert for QoS misclassification (anomalies)

Closing: don’t let QoS postpone the capacity conversation

QoS is a good seatbelt, but it is not a brake. If you keep “rescuing” things with QoS, it is time to put capacity planning, path diversification or application-level degrade/load shedding on the agenda.

If I had to compress this article into one sentence: manage QoS not as a “policy” but as an “operational contract”.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts