İçeriğe Atla
Mustafa Erbay
Technology · 11 min read · görüntülenme Türkçe oku
100%

Enterprise Edge Resolver Architecture with Anycast DNS

An approach for placing the in-house DNS resolver tier near the POP/branch using Anycast — cutting latency while improving operability.

Enterprise Edge Resolver Architecture with Anycast DNS — cover image

When users complain about a “slow internet” inside the corporate network, the trail usually ends at DNS: the application is the same, the link is the same — but the resolver is far away, the cache is empty, and the upstream takes its time answering. Fixing DNS isn’t about “let’s stand up another server”; it’s about treating the resolver tier as a first-class architectural component.

In this post I walk through the approach that has held up in the field for me: edge resolver + Anycast + BGP control + health signaling.

1) Get the concept straight first: recursive, not authoritative

The target here is:

  • User/service → internal recursive resolver (Unbound/BIND/PowerDNS Recursor, etc.)
  • Resolver → (when needed) upstream (ISP, public resolver, internal authoritative, forwarder)

We usually see Anycast on the authoritative side, but putting the recursive resolver behind Anycast inside the enterprise measurably cuts the branch/POP latency and lifts cache effectiveness.

2) What you gain from Anycast (and what you might lose)

Gains

  • Lower p95 DNS latency on the branch/POP side
  • Cache running “closer” means a higher hit ratio
  • When a resolver fails, traffic naturally drifts to another node (with the right signal)

Risks

  • A bad BGP advertisement → traffic to the wrong place (the most expensive class of mistake)
  • Cache warm-up wave (load on upstreams after a failover)
  • “Nearest” is not always “best” (on some links the shortest AS-path can be the worst path)

3) Reference architecture: a two-tier resolver

The simple model I like to deploy at enterprises:

  1. Edge resolver pool (Anycast VIP)
    A small pool near the POP/branch: 2–3 nodes (VM/metal). Each announces the VIP via BGP.
  2. Core upstream
    Larger capacity at the data center/headquarters with richer policy: internal zones, split-horizon, record management, logging.

On the edge resolver side the goal isn’t “every policy lives here”; it’s a close cache plus a fast answer.

4) BGP design: safe defaults

When you announce the Anycast resolver via BGP, the “minimum safe set” is:

  • Prefix: only the resolver VIP /32 (IPv4) or /128 (IPv6)
  • Scope: iBGP (same AS) + controlled eBGP (between POPs)
  • Guardrails:
    • max-prefix limit
    • prefix-list + route-map so only that VIP gets advertised
    • where possible RPKI/ROA plus disciplined upstream filters

A recommendation: don’t anycast to the public internet on day one. Mature it first via internal-only routing.

5) Health signaling: tie “is DNS alive?” to routing

The critical piece in Anycast is this: when the service on a node breaks, if the route isn’t withdrawn, traffic keeps flowing into a “close but broken” node.

A practical health-check approach:

  • A local health endpoint on the resolver, or a tiny script
  • Verify the answers to these questions:
    • dig @127.0.0.1 example.com A p95 < X ms
    • Is an internal zone (e.g. corp.local) responding?
    • Are upstream timeouts trending up?
  • If unhealthy:
    • Pull the BGP announcement automatically (BIRD/FRR + healthcheck integration)
    • Or lower preference via route-map (the riskier second option)

6) Operations: you can’t run Anycast without measuring

The minimum metric set I track on edge resolvers:

  • DNS latency: p50/p95/p99 (UDP/TCP separately)
  • RCODE distribution: NOERROR/NXDOMAIN/SERVFAIL/REFUSED
  • Upstream timeout ratio
  • Cache hit ratio + evictions
  • QPS + concurrency
  • Top query + top NXDOMAIN (for anomaly hunting)

Sample alarms:

  • SERVFAIL ratio jumps from 1% to 10% over 5 minutes
  • Upstream timeout climb plus latency climb together
  • Only one node remains in the Anycast pool (loss of redundancy)

7) Runbook: a 15-minute triage during a “DNS is slow” incident

  1. Client side: how many POPs are affected at the same time?
  2. Resolver tier: are the POPs hitting the same Anycast VIP?
  3. Resolver node health: compare dig @vip against dig @node-ip
  4. Upstream: is there a timeout/latency increase?
  5. Intervention options:
    • Fastest: withdraw the BGP advertisement on the bad node
    • Medium: turn on resolver cache parameters (serve-stale, etc.)
    • Permanent: improve upstream/topology

8) Closing thought

Anycast DNS inside the enterprise is not “scaling DNS up”; it is turning DNS into a manageable platform tier. When you build routing, observability, and runbook discipline together, user complaints drop, incidents shrink, and the infrastructure runs more calmly.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts