Once a WAN bottleneck appears, the first reflex is usually “let’s enable QoS so that critical traffic survives”. Configured correctly, QoS really does save the day; configured wrong, it hides the problem, makes diagnosis harder and misleads the team.
I’ll skip the long theory and write the way I run it in the field: DSCP marking + trust boundary + queueing policy + measurement.
1) Define classes first (by flow, not by team)
QoS classes should not be defined by “department” or “application name” — they should be defined by flow behavior:
- Interactive: low latency/jitter (VoIP, VDI, terminal)
- Transactional: short bursts, error-sensitive (API, payment, login)
- Bulk: large transfers, latency-tolerant (backup, artifact, replication)
- Control-plane: routing, keepalive, monitoring
My recommendation: don’t go past 3–5 classes early on. Many classes plus many exceptions equals operational debt.
2) DSCP: not a “label”, a contract
Think of DSCP this way:
- You mark at the edge (marking)
- You preserve it through the network (trust boundary)
- You convert it into behavior at the bottleneck (queueing/shaping)
When you choose a DSCP class, write a short “contract” document:
- The DSCP value
- Which flows it covers
- Where it gets marked (ingress)
- Where it gets rewritten (remark)
- Where it is “trusted”
No QoS without a trust boundary
If every device “trusts” DSCP, a single mis-marking service can hold the entire WAN hostage.
A practical rule:
- On the access/host side do not trust DSCP — rewrite it
- Put the trust boundary on the DC edge / SD-WAN edge
- On the WAN core, work with the “trusted classes”
3) Where is the bottleneck? (Hint: it’s not just the WAN link)
Applying QoS only on the WAN link is often not enough. Bottlenecks also live in:
- The internet egress
- VPN/IPsec tunnels (encryption CPU, MTU, fragmentation)
- Cloud interconnect (rate limit / policing)
- Firewall/NGFW throughput
Which is why a “QoS rollout” should start with a bottleneck map:
- Link capacity
- Real usage (p95)
- Drop reason (queue tail drop or policer?)
- MTU/fragment indicators
4) Queueing policy: not “priority” but “fair share”
Two common mistakes:
- Marking everything as “high priority”
- Giving high priority unlimited bandwidth
My approach:
- Give the interactive class a low but guaranteed bandwidth on a low-latency queue
- Give the transactional class guaranteed + burst
- Give the bulk class the leftover bandwidth and aggressive shaping
- Give control-plane a small but untouchable slice
5) MTU and tunnels: the quiet killer of QoS
If QoS looks correct on the WAN but users still complain, my checklist:
- After IPsec overhead, what is the effective MTU?
- Is MSS clamping in place?
- Are fragments/drops climbing?
- Is DSCP being lost inside the tunnel (encapsulation remark)?
Prove that DSCP is carried across tunnel ingress/egress with a short packet capture (inner + outer header).
6) Rollout plan (operationally safe)
This is not “global enable in one shot” — it has to be done ring by ring:
- Define the classes (document + ownership)
- At the edge, do marking (passive: only mark, no queueing yet)
- Measure: DSCP distribution, mis-marked flows
- At egress, enable queueing/shaping (canary site)
- SLO: p95 latency/jitter targets
- Runbook: rollback (single command / single policy)
7) Success criteria (measurable)
The picture I call “QoS successful”:
- During a bottleneck, interactive p95/jitter is preserved
- The transactional error rate does not climb (timeouts/5xx)
- Bulk transfers slow down but do not get “killed”
- There is an alert for QoS misclassification (anomalies)
Closing: don’t let QoS postpone the capacity conversation
QoS is a good seatbelt, but it is not a brake. If you keep “rescuing” things with QoS, it is time to put capacity planning, path diversification or application-level degrade/load shedding on the agenda.
If I had to compress this article into one sentence: manage QoS not as a “policy” but as an “operational contract”.