Nginx's Sneaky DNS Trap: Failing to Reach Docker Containers

It was last Thursday morning when one of the API endpoints for hesapciyiz.com suddenly started returning a 502 Bad Gateway error. My first thought, of course, was that the backend application had crashed or there was a database connection issue. However, when I ran the docker logs command, I saw that the relevant container was running healthily.

When I checked the Nginx error logs, the situation became even more interesting: [error] 31#31: *12345 host not found in upstream "my-api-service". Sometimes I would also see messages like upstream prematurely closed connection. Since I manage more than 13 Docker containers on my own VPS, these kinds of network and access issues crop up from time to time, but this one was a bit sneakier. Because the service was actually alive.

Symptoms and Initial Observations: What Was Going Wrong?

Why was the hesapciyiz.com API giving a 502? First, here were the symptoms:

502 Bad Gateway: Users were receiving this error when making requests to the API.
Nginx Error Logs: I was seeing DNS resolution errors like host not found in upstream "my-api-service" or could not resolve host. Occasionally, I encountered upstream prematurely closed connection, which, while appearing more general, could stem from the same underlying DNS issue.
Intermittent Failure: The most annoying part was that the problem wasn’t constant. Sometimes it would fix itself, then start again. This suggested the issue might be related to caching, TTL, or dynamic IP assignments.
Container was Healthy: When I checked with the docker ps command, the API container was in the Up state, and docker logs showed no application errors.
Direct Access from Host: When I connected to the server via SSH and tried curl http://my-api-service:8000/health (using the port and name inside the container), I saw that the API responded successfully. This confirmed that the application was indeed running and accessible within the Docker network.

My first thought was whether there was an issue with the Docker network. I checked the network settings using docker network inspect bridge, but everything looked normal. I even took the container’s IP address and tried it directly in the Nginx config, and it worked that way. But since Docker containers receive dynamic IPs, this wasn’t a permanent solution. Nginx needed to resolve based on the hostname.

I could ping from the host, I could reach the outside world from inside the container, but Nginx… Nginx was as if it were in another world, unable to find the name my-api-service. This situation reminded me once again of how Nginx handles DNS resolution and how that clashes with Docker’s dynamic network structure.

Docker Network’s Dynamic Nature and Nginx’s Static DNS Mindset

At the root of this problem lies a mismatch between Nginx’s DNS resolution habits and Docker’s dynamic network management. First, let’s look at the Docker side:

Docker connects containers to bridge networks it creates. Within these networks, each container is dynamically assigned an IP address. Additionally, thanks to Docker’s embedded DNS server (usually running at 127.0.0.11), containers on the same network can access each other using service names (like my-api-service). This is a great feature that makes life easier for developers. However, when a container is restarted, it is highly likely that the same service will receive a different IP address.

So, how does Nginx deal with this dynamism? This is where the trouble starts. By default, when Nginx sees a hostname in a proxy_pass directive or an upstream block, it queries the DNS server for this hostname only when Nginx starts or when the configuration file (nginx.conf) is reloaded and caches the resulting IP address.

Since I run more than 13 containers on my VPS, this dynamism and these restarts happen quite frequently. A container exceeding its memory limit and getting OOM-killed (like when I wrote sleep 360 and got OOM-killed last month), a new deployment, or even a simple system update can cause containers to restart. Every restart means a potential IP change and, therefore, a problem for Nginx.

It was as if Nginx had asked for the mailman’s address once and then kept trying to send letters to the old address without ever learning that the mailman had moved. My hesapciyiz.com API had fallen into this “moved mailman” situation, and Nginx couldn’t find it.

Root Cause: Deep Dive into Nginx and DNS Resolution

This “one-time DNS resolution” behavior of Nginx is actually an optimization designed for high performance. Instead of performing a DNS query for every request, it reduces latency by caching the resolved IP. However, in dynamic environments—especially when using orchestration tools like Docker or Kubernetes—this optimization can turn into a vulnerability.

When you use a hostname inside an upstream block or a proxy_pass directive, Nginx, by default, resolves this hostname at startup via the DNS servers defined in the system’s /etc/resolv.conf file. If this hostname isn’t defined in /etc/hosts or can’t be found by the DNS server, Nginx will error out at startup or log an error. But in our case, everything was fine at the start; the problem emerged later when the container IP changed.

This also happened to me with the backend service for islistesi.com on my own VPS. After a deployment, the container restarted, got a new IP, and Nginx kept sending requests to the old IP. When I realized the situation, restarting Nginx was a temporary fix, but it didn’t solve the root of the problem. Manually restarting Nginx after every deployment was not a sustainable operation.

The core of the issue stemmed from Nginx’s lack of a mechanism to refresh the DNS cache. While this behavior is acceptable for static IPs or hostnames that change very rarely, in environments like Docker that assign dynamic IPs and where services can restart frequently, this can lead to serious outages.

Towards a Solution: The Nginx Resolver Directive

To overcome these types of dynamic DNS resolution issues, Nginx’s resolver directive comes into play. The resolver directive allows you to specify which DNS server Nginx should use to resolve certain hostnames and how long these resolutions should be kept in the cache (TTL - Time To Live).

This directive is a way of telling Nginx, “Hey Nginx, ask this place for this hostname every time and remember the answer for this long.” This way, Nginx can get the most up-to-date IP address by querying the DNS server for every request or at regular intervals.

There are two important points to consider when using the resolver directive:

DNS Server Address: We must specify which DNS server to use. In a Docker environment, we usually use Docker’s own internal DNS server at the address 127.0.0.11 for name resolution between containers. This is the fastest and most reliable solution. Alternatively, you can use the DNS servers defined in your server’s /etc/resolv.conf (e.g., 8.8.8.8 or 1.1.1.1), but in that case, resolution happens via the general internet DNS instead of Docker’s internal name resolution.
valid Parameter: This parameter determines how long Nginx will cache the DNS resolution (TTL). In dynamic environments, it’s important to keep this duration short since IP addresses can change frequently. I usually use short durations like valid=5s or valid=10s. This means Nginx will refresh the DNS record every 5 or 10 seconds.

By correctly setting these two parameters, we can make Nginx more flexible and resilient against the dynamic IP changes of Docker containers.

Nginx Configuration: Implementing `resolver`

Now let’s look at the Nginx configuration. We can define the resolver directive globally within the http block or locally within specific server or location blocks. If you host multiple services on a single VPS like I do, it’s generally more practical to define it globally within the http block.

Here is a simplified example of the Nginx configuration I implemented for hesapciyiz.com:

http {
    # Docker's internal DNS server and 5-second TTL
    resolver 127.0.0.11 valid=5s; 

    # Alternatively, the host's DNS: 
    # resolver 8.8.8.8 8.8.4.4 valid=30s; # Google DNS

    server {
        listen 80;
        server_name hesapciyiz.com www.hesapciyiz.com;

        location /api/ {
            # To enable dynamic resolution, the hostname must be assigned to a variable.
            # When Nginx sees a variable in proxy_pass, it uses the resolver.
            set $upstream_docker_service "my-api-service:8000"; 

            proxy_pass http://$upstream_docker_service;
            
            # Standard proxy_set_header directives
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            
            # Connection and read timeout settings
            proxy_connect_timeout 5s;
            proxy_send_timeout 5s;
            proxy_read_timeout 15s;
        }

        # Other location blocks (e.g., static files or other applications)
        # ...
    }
}

There are several critical points in this configuration:

resolver 127.0.0.11 valid=5s;: This directive tells Nginx to use Docker’s own DNS server to resolve service names within the Docker network and to keep this resolution in the cache for only 5 seconds. This ensures that even if the my-api-service container restarts and its IP changes, Nginx will learn the new IP address within 5 seconds at the latest.
set $upstream_docker_service "my-api-service:8000";: This step is very important. If Nginx sees a hostname directly in the proxy_pass directive (proxy_pass http://my-api-service:8000;), it resolves that hostname once at the beginning. However, when a variable ($upstream_docker_service) is used, Nginx resolves the value of this variable dynamically for every request or at regular intervals. This triggers the resolver directive. This is a design “quirk” of Nginx and is a detail that must be known for dynamic DNS resolution.
proxy_connect_timeout, proxy_send_timeout, proxy_read_timeout: Adding these timeout settings can help reduce connection issues, especially when Docker containers are just starting up or are under heavy load. The initial connection can sometimes be slow, so it’s good to keep these timeouts a bit higher. Due to resource intensity (CPU, RAM) on my VPS, containers can sometimes respond slowly, so these timeouts can be lifesavers.

After applying these changes, Nginx needs to be reloaded or restarted.

`systemd` and Nginx Restart Strategies

When you make a change in the Nginx configuration file, you need to reload or restart the Nginx service for those changes to take effect. Two basic commands come into play here:

sudo systemctl reload nginx: This command reconfigures Nginx with zero downtime. While existing worker processes load the new configuration, old workers can still handle active connections. New connections are directed to the new workers. In most cases, especially for small configuration changes, this is preferred. However, for changes affecting DNS caching like the resolver directive, sometimes the DNS cache of old workers might not be cleared immediately.
**`

Nginx's Sneaky DNS Trap: Failing to Reach Docker Containers

Symptoms and Initial Observations: What Was Going Wrong?

Docker Network’s Dynamic Nature and Nginx’s Static DNS Mindset

Root Cause: Deep Dive into Nginx and DNS Resolution

Towards a Solution: The Nginx Resolver Directive

Nginx Configuration: Implementing `resolver`

`systemd` and Nginx Restart Strategies

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

VPS Swap Fire: A Nightmare Started by a Kernel CVE Patch

Metric Collection: Push vs. Pull Models - When to Use Which?

Moving My GitHub Actions Runner to My Own VPS

Symptoms and Initial Observations: What Was Going Wrong?

Docker Network’s Dynamic Nature and Nginx’s Static DNS Mindset

Root Cause: Deep Dive into Nginx and DNS Resolution

Towards a Solution: The Nginx Resolver Directive

Nginx Configuration: Implementing resolver

systemd and Nginx Restart Strategies

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

VPS Swap Fire: A Nightmare Started by a Kernel CVE Patch

Metric Collection: Push vs. Pull Models - When to Use Which?

Moving My GitHub Actions Runner to My Own VPS

Klavye Kısayolları

Nginx Configuration: Implementing `resolver`

`systemd` and Nginx Restart Strategies