When users complain about a “slow internet” inside the corporate network, the trail usually ends at DNS: the application is the same, the link is the same — but the resolver is far away, the cache is empty, and the upstream takes its time answering. Fixing DNS isn’t about “let’s stand up another server”; it’s about treating the resolver tier as a first-class architectural component.
In this post I walk through the approach that has held up in the field for me: edge resolver + Anycast + BGP control + health signaling.
1) Get the concept straight first: recursive, not authoritative
The target here is:
- User/service → internal recursive resolver (Unbound/BIND/PowerDNS Recursor, etc.)
- Resolver → (when needed) upstream (ISP, public resolver, internal authoritative, forwarder)
We usually see Anycast on the authoritative side, but putting the recursive resolver behind Anycast inside the enterprise measurably cuts the branch/POP latency and lifts cache effectiveness.
2) What you gain from Anycast (and what you might lose)
Gains
- Lower p95 DNS latency on the branch/POP side
- Cache running “closer” means a higher hit ratio
- When a resolver fails, traffic naturally drifts to another node (with the right signal)
Risks
- A bad BGP advertisement → traffic to the wrong place (the most expensive class of mistake)
- Cache warm-up wave (load on upstreams after a failover)
- “Nearest” is not always “best” (on some links the shortest AS-path can be the worst path)
3) Reference architecture: a two-tier resolver
The simple model I like to deploy at enterprises:
- Edge resolver pool (Anycast VIP)
A small pool near the POP/branch: 2–3 nodes (VM/metal). Each announces the VIP via BGP. - Core upstream
Larger capacity at the data center/headquarters with richer policy: internal zones, split-horizon, record management, logging.
On the edge resolver side the goal isn’t “every policy lives here”; it’s a close cache plus a fast answer.
4) BGP design: safe defaults
When you announce the Anycast resolver via BGP, the “minimum safe set” is:
- Prefix: only the resolver VIP /32 (IPv4) or /128 (IPv6)
- Scope: iBGP (same AS) + controlled eBGP (between POPs)
- Guardrails:
- max-prefix limit
- prefix-list + route-map so only that VIP gets advertised
- where possible RPKI/ROA plus disciplined upstream filters
A recommendation: don’t anycast to the public internet on day one. Mature it first via internal-only routing.
5) Health signaling: tie “is DNS alive?” to routing
The critical piece in Anycast is this: when the service on a node breaks, if the route isn’t withdrawn, traffic keeps flowing into a “close but broken” node.
A practical health-check approach:
- A local health endpoint on the resolver, or a tiny script
- Verify the answers to these questions:
dig @127.0.0.1 example.com Ap95 < X ms- Is an internal zone (e.g.
corp.local) responding? - Are upstream timeouts trending up?
- If unhealthy:
- Pull the BGP announcement automatically (BIRD/FRR + healthcheck integration)
- Or lower preference via route-map (the riskier second option)
6) Operations: you can’t run Anycast without measuring
The minimum metric set I track on edge resolvers:
- DNS latency: p50/p95/p99 (UDP/TCP separately)
- RCODE distribution: NOERROR/NXDOMAIN/SERVFAIL/REFUSED
- Upstream timeout ratio
- Cache hit ratio + evictions
- QPS + concurrency
- Top query + top NXDOMAIN (for anomaly hunting)
Sample alarms:
- SERVFAIL ratio jumps from 1% to 10% over 5 minutes
- Upstream timeout climb plus latency climb together
- Only one node remains in the Anycast pool (loss of redundancy)
7) Runbook: a 15-minute triage during a “DNS is slow” incident
- Client side: how many POPs are affected at the same time?
- Resolver tier: are the POPs hitting the same Anycast VIP?
- Resolver node health: compare
dig @vipagainstdig @node-ip - Upstream: is there a timeout/latency increase?
- Intervention options:
- Fastest: withdraw the BGP advertisement on the bad node
- Medium: turn on resolver cache parameters (serve-stale, etc.)
- Permanent: improve upstream/topology
8) Closing thought
Anycast DNS inside the enterprise is not “scaling DNS up”; it is turning DNS into a manageable platform tier. When you build routing, observability, and runbook discipline together, user complaints drop, incidents shrink, and the infrastructure runs more calmly.