Self-Healing Services with systemd Watchdog

One of the most dangerous failure classes in production is “the process is alive but not doing any work.” CPU is low, the port is open, the health check is green… yet the queue grows, latency creeps up, and the system “dies slowly.” It’s hard to catch this kind of stall with Restart=always alone, because the process never actually crashes.

The systemd watchdog is a solid tool against precisely this class: if the service doesn’t emit an “I’m alive” signal at a certain cadence, systemd restarts the service.

What does the watchdog solve and what doesn’t it solve?

What it solves:

deadlocks / event loop lockups
“infinite hang” while waiting on an external dependency
loss of “progress” inside the service

What it doesn’t solve:

incorrect business logic (the service replies wrong but is still “alive”)
dependency issues (if the DB is down, restart loops can make things worse)

Baseline unit file (Type=notify + WatchdogSec)

For the watchdog to work, your service has to emit READY/WATCHDOG signals via sd_notify. The systemd-side baseline:

[Unit]
Description=My Service (with watchdog)
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
NotifyAccess=main

ExecStart=/usr/local/bin/my-service

WatchdogSec=30s
Restart=on-failure
RestartSec=3s

TimeoutStartSec=30s
TimeoutStopSec=20s

StartLimitIntervalSec=120s
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

What this unit is saying:

if the service doesn’t emit a “WATCHDOG=1” signal for more than 30 seconds, it gets restarted
if it crashes, it gets restarted
if it restarts too frequently (burst), systemd applies the brakes

Service side: how do we generate the health signal?

Two practical approaches:

1) Adding notify inside the application (the most correct path)

The thread/event loop where the service runs emits sd_notify("WATCHDOG=1") at regular intervals. The READY signal should be sent at the moment “I can really take traffic now.”

2) The wrapper-based partial approach (limited but sometimes enough)

If the application can’t be modified, there are patterns that try to produce a “progress” signal via a wrapper process; but for true watchdog behavior you still need notify integration. So the long-term goal is for the application itself to support notify.

Operational tuning: keep the restart loop in check

Once the watchdog kicks in, the first thing to degrade is “noise.” Put these controls in place:

a brake via StartLimit*
write the restart reason to the service logs (in the application’s own logs)
only escalate the alarm in cases of “too-frequent restarts”

Sample triage commands:

systemctl status my-service --no-pager
journalctl -u my-service -n 200 --no-pager
systemctl show my-service -p NRestarts -p RestartUSec -p WatchdogUSec

Wiring alarms/automation via OnFailure

A more advanced but very useful pattern: triggering an automated action when the service fails (ticket, Slack, script). Example:

[Unit]
OnFailure=my-service-failure@%n.service

The goal here isn’t to “panic on every restart”; it’s to make recurring failures visible.

Wrap-up

The watchdog reduces the “didn’t die but got stuck” class in production and lowers MTTR. Success isn’t just turning on WatchdogSec; it’s defining the READY/WATCHDOG signal correctly, braking the restart-storm risk, and tying alarms to the right thresholds. From an operational leadership angle, this also makes “service is up” metrics more honest.

Self-Healing Services with systemd Watchdog

What does the watchdog solve and what doesn’t it solve?

Baseline unit file (Type=notify + WatchdogSec)

Service side: how do we generate the health signal?

1) Adding notify inside the application (the most correct path)

2) The wrapper-based partial approach (limited but sometimes enough)

Operational tuning: keep the restart loop in check

Wiring alarms/automation via OnFailure

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Core Dump Management and Privacy Runbook with systemd-coredump

Service Discovery with Consul: Health Checks and the DNS Interface

Centralized Logging with systemd-journal-remote: mTLS and Retention

What does the watchdog solve and what doesn’t it solve?

Baseline unit file (Type=notify + WatchdogSec)

Service side: how do we generate the health signal?

1) Adding notify inside the application (the most correct path)

2) The wrapper-based partial approach (limited but sometimes enough)

Operational tuning: keep the restart loop in check

Wiring alarms/automation via OnFailure

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Core Dump Management and Privacy Runbook with systemd-coredump

Service Discovery with Consul: Health Checks and the DNS Interface

Centralized Logging with systemd-journal-remote: mTLS and Retention

Klavye Kısayolları