One of the hardest classes of incident to diagnose in production is this: the system looks generally up, yet some users work while others don’t. In particular:
- TLS handshakes hang for some clients,
- API calls succeed with a small payload but time out with a large one,
- the same endpoint behaves well from some networks and poorly from others.
This picture is often read as “an application bug.” Yet there’s a frequent root cause: MTU/PMTUD blackhole.
Concept: what does PMTUD solve?
Path MTU Discovery (PMTUD) is used to find the “maximum transmittable packet size” (MTU) along the path between two endpoints. It typically breaks for this reason:
- The endpoint sends a large packet with the DF (Don’t Fragment) bit set
- A device along the path can’t forward the packet (a smaller MTU is required)
- The device should respond with an ICMP message similar to “fragmentation needed”
- If that ICMP is blocked, the endpoint can’t lower the MTU → packets are silently dropped → blackhole
The most common triggers
The transitions where this most often shows up in the field:
- IPsec/GRE tunnels (overlay header + crypto overhead)
- SD-WAN/MPLS edge transitions
- Cloud interconnect / transit gateway / firewall zone transitions
- Mixing jumbo-frame segments with 1500 MTU segments
- “MSS clamp” applied but not covering all endpoints, leaving the fix half-done
Incident triage: a 15-minute fast diagnosis
Goal: quickly answer “is this an application issue, or path MTU/PMTUD?“
1) Tie the symptom to packet size
- If small payload works and large payload breaks, MTU likelihood rises.
- Protocols like HTTP/2 or gRPC can show a similar pattern; again, look for the packet-size relationship.
2) Ping with DF (mind the Linux/macOS/Windows differences)
Run a DF test toward the remote endpoint (or a test endpoint near the suspect hop). Example (Linux):
ping -M do -s 1472 <hedef-ip> -c 3
ping -M do -s 1400 <hedef-ip> -c 3
1472payload + 28 byte IP/ICMP header ≈ 1500 MTU- If the large packet fails and the small packet works, you have a “smaller MTU on the path” signal.
3) tracepath / traceroute for “pmtu” hints
tracepath <hedef-ip> | head -n 20
If you see a pmtu line or messages like “too big,” the case is even stronger.
4) On the application side, TCP signal: retrans + stalls
If you have access, take a short capture on the affected host:
sudo tcpdump -nn -i any 'host <hedef-ip> and tcp' -c 200
Look for:
- The same segment being sent over and over (retransmission)
- SYN/SYN-ACK present, but the handshake doesn’t progress afterward (could be MSS/MTU breakage)
Mitigation: fast, safe first moves
During an incident, the “least-risky” intervention is usually one of these (depending on the environment):
1) TCP MSS clamping (temporary relief)
Applying MSS clamping at the tunnel or edge firewall can recover services while PMTUD is broken. But this isn’t a “set and forget” solution; it can mask the root cause.
2) Set tunnel/overlay MTU correctly
On encapsulating layers like IPsec/GRE, effective MTU drops. Consider the tunnel interface MTU and endpoint PMTUD behavior together.
3) Allowlist the ICMP that PMTUD needs
The goal isn’t to fully open ICMP; it’s to let the types/codes required for PMTUD pass under control and log them.
Root-cause closure: lasting countermeasures
1) Add an MTU test to the change checklist
Especially for these changes:
- new tunnel/overlay
- new firewall transition
- moving to a different provider/POP
- jumbo-frame rollout
Standardize the “do large packets pass through?” test.
2) Observability: catch MTU-driven incidents with metrics
Good signals:
- A rise in TCP retransmission rate
- SYN completes but the application-layer handshake (TLS) doesn’t
- A clear latency + timeout uptick on specific paths/segments
3) Documentation: make “MTU facts” visible
In production, MTU is often an “assumption” no one owns. Write down the effective MTU per segment (especially after tunnel/encryption).
Conclusion
An MTU/PMTUD blackhole prolongs incidents by masquerading as an application bug. For a correct diagnosis, tie the symptom to packet size; narrow the probability with quick DF tests; tie the temporary mitigation (MSS clamp) to a permanent solution (correct MTU + correct ICMP allowlist + change tests). In operational reality, success isn’t a “very technical narrative” — it’s a repeatable runbook and closure discipline.