In enterprise infrastructures, service discovery often gets reduced to “add a DNS record.” Then the following problems show up:
- When you move a service, the DNS reality lags behind (TTL/caching).
- When a node breaks, DNS still hands it back (no health awareness).
- Teams keep asking “where was this service running again?” by digging through wikis, spreadsheets, and tickets.
A discovery layer like Consul isn’t just “another product” here; it’s a control plane that turns the variability of the infrastructure into something manageable. In this article I focus on building an operable, field-ready model, especially via health checks + the DNS interface.
1) Frame the problem correctly: discovery, or routing?
Discovery’s goal isn’t “moving traffic”; it’s finding the right target.
- The routing/LB layer carries the traffic (L4/L7).
- The discovery layer answers the question “which instance is healthy?”
If you don’t make this distinction, you end up loading discovery with unwarranted expectations and the design balloons.
2) Where should Consul sit? (minimum viable)
Core building blocks:
- A Consul server cluster (odd count, e.g. 3/5)
- Consul agents (on each node)
- Service registration model (catalog)
- Health checks (critical)
- DNS interface (for clients)
3) Health-check design: not ping, but ability to do work
Split health checks into three classes:
- Process check: is the service process up?
- Port check: is the port listening?
- Functional check: can the service actually do work? (critical path)
Push the discovery decision toward class 3 wherever possible. Because a wrong “healthy” verdict is the most expensive type of problem: it returns errors to users and stretches triage out.
4) The DNS interface: making peace with TTL and caching
The DNS interface gives you broad and pragmatic client compatibility. But by the nature of DNS:
- There is caching
- TTL behavior is not the same everywhere
So for DNS-based discovery there are two pragmatic approaches:
- Low TTL + observation: changes propagate quickly but load goes up
- Medium TTL + stability: less load, slower propagation
The right approach I’ve seen in the field isn’t “the lowest TTL”; it’s an operable TTL. Also, for “high churn” services (pods/instances that change very often), client-side discovery (like a sidecar) may fit better than DNS.
5) Runbook: what do I do during a “going to the wrong target” incident?
Symptom: some requests get errors, others don’t; load is fluctuating.
- Which instances are DNS answers returning? (sample it)
- Is the health-check verdict actually correct?
- Pull the problematic instance out of the catalog (temporarily) and find the root cause
- Are stale answers still circulating because of TTL/cache?
- Has Consul server/agent latency gone up? (raft, disk, network)
6) Security: discovery = inventory + target map
The information in the Consul catalog is also valuable to an attacker. So:
- Move UI/API access into the management network
- Use the ACL/policy model
- Wire audit logs into the central log/SIEM pipeline
Don’t let discovery, “for the sake of convenience,” turn into an inventory leak surface.
7) Closing
Service discovery with Consul is more than DNS record management: it produces a target list that lives off of health signals. The right design accepts the TTL/caching reality, pushes health checks closer to the ability to do real work, and operates the discovery layer as a critical service. Once that discipline is in place, “where is the service?” stops being a ticket and becomes a system answer.