Service Discovery with Consul: Health Checks and the DNS Interface

In enterprise infrastructures, service discovery often gets reduced to “add a DNS record.” Then the following problems show up:

When you move a service, the DNS reality lags behind (TTL/caching).
When a node breaks, DNS still hands it back (no health awareness).
Teams keep asking “where was this service running again?” by digging through wikis, spreadsheets, and tickets.

A discovery layer like Consul isn’t just “another product” here; it’s a control plane that turns the variability of the infrastructure into something manageable. In this article I focus on building an operable, field-ready model, especially via health checks + the DNS interface.

1) Frame the problem correctly: discovery, or routing?

Discovery’s goal isn’t “moving traffic”; it’s finding the right target.

The routing/LB layer carries the traffic (L4/L7).
The discovery layer answers the question “which instance is healthy?”

If you don’t make this distinction, you end up loading discovery with unwarranted expectations and the design balloons.

2) Where should Consul sit? (minimum viable)

Core building blocks:

A Consul server cluster (odd count, e.g. 3/5)
Consul agents (on each node)
Service registration model (catalog)
Health checks (critical)
DNS interface (for clients)

3) Health-check design: not ping, but ability to do work

Split health checks into three classes:

Process check: is the service process up?
Port check: is the port listening?
Functional check: can the service actually do work? (critical path)

Push the discovery decision toward class 3 wherever possible. Because a wrong “healthy” verdict is the most expensive type of problem: it returns errors to users and stretches triage out.

4) The DNS interface: making peace with TTL and caching

The DNS interface gives you broad and pragmatic client compatibility. But by the nature of DNS:

There is caching
TTL behavior is not the same everywhere

So for DNS-based discovery there are two pragmatic approaches:

Low TTL + observation: changes propagate quickly but load goes up
Medium TTL + stability: less load, slower propagation

The right approach I’ve seen in the field isn’t “the lowest TTL”; it’s an operable TTL. Also, for “high churn” services (pods/instances that change very often), client-side discovery (like a sidecar) may fit better than DNS.

5) Runbook: what do I do during a “going to the wrong target” incident?

Symptom: some requests get errors, others don’t; load is fluctuating.

Which instances are DNS answers returning? (sample it)
Is the health-check verdict actually correct?
Pull the problematic instance out of the catalog (temporarily) and find the root cause
Are stale answers still circulating because of TTL/cache?
Has Consul server/agent latency gone up? (raft, disk, network)

6) Security: discovery = inventory + target map

The information in the Consul catalog is also valuable to an attacker. So:

Move UI/API access into the management network
Use the ACL/policy model
Wire audit logs into the central log/SIEM pipeline

Don’t let discovery, “for the sake of convenience,” turn into an inventory leak surface.

7) Closing

Service discovery with Consul is more than DNS record management: it produces a target list that lives off of health signals. The right design accepts the TTL/caching reality, pushes health checks closer to the ability to do real work, and operates the discovery layer as a critical service. Once that discipline is in place, “where is the service?” stops being a ticket and becomes a system answer.

Service Discovery with Consul: Health Checks and the DNS Interface

1) Frame the problem correctly: discovery, or routing?

2) Where should Consul sit? (minimum viable)

3) Health-check design: not ping, but ability to do work

4) The DNS interface: making peace with TTL and caching

5) Runbook: what do I do during a “going to the wrong target” incident?

6) Security: discovery = inventory + target map

7) Closing

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Self-Healing Services with systemd Watchdog

Kubernetes Service Discovery Crisis: The Dark Side of DNS

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

1) Frame the problem correctly: discovery, or routing?

2) Where should Consul sit? (minimum viable)

3) Health-check design: not ping, but ability to do work

4) The DNS interface: making peace with TTL and caching

5) Runbook: what do I do during a “going to the wrong target” incident?

6) Security: discovery = inventory + target map

7) Closing

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Self-Healing Services with systemd Watchdog

Kubernetes Service Discovery Crisis: The Dark Side of DNS

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Klavye Kısayolları