One of the most dangerous failure classes in production is “the process is alive but not doing any work.” CPU is low, the port is open, the health check is green… yet the queue grows, latency creeps up, and the system “dies slowly.” It’s hard to catch this kind of stall with Restart=always alone, because the process never actually crashes.
The systemd watchdog is a solid tool against precisely this class: if the service doesn’t emit an “I’m alive” signal at a certain cadence, systemd restarts the service.
What does the watchdog solve and what doesn’t it solve?
What it solves:
- deadlocks / event loop lockups
- “infinite hang” while waiting on an external dependency
- loss of “progress” inside the service
What it doesn’t solve:
- incorrect business logic (the service replies wrong but is still “alive”)
- dependency issues (if the DB is down, restart loops can make things worse)
Baseline unit file (Type=notify + WatchdogSec)
For the watchdog to work, your service has to emit READY/WATCHDOG signals via sd_notify. The systemd-side baseline:
[Unit]
Description=My Service (with watchdog)
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
NotifyAccess=main
ExecStart=/usr/local/bin/my-service
WatchdogSec=30s
Restart=on-failure
RestartSec=3s
TimeoutStartSec=30s
TimeoutStopSec=20s
StartLimitIntervalSec=120s
StartLimitBurst=10
[Install]
WantedBy=multi-user.target
What this unit is saying:
- if the service doesn’t emit a “WATCHDOG=1” signal for more than 30 seconds, it gets restarted
- if it crashes, it gets restarted
- if it restarts too frequently (burst),
systemdapplies the brakes
Service side: how do we generate the health signal?
Two practical approaches:
1) Adding notify inside the application (the most correct path)
The thread/event loop where the service runs emits sd_notify("WATCHDOG=1") at regular intervals. The READY signal should be sent at the moment “I can really take traffic now.”
2) The wrapper-based partial approach (limited but sometimes enough)
If the application can’t be modified, there are patterns that try to produce a “progress” signal via a wrapper process; but for true watchdog behavior you still need notify integration. So the long-term goal is for the application itself to support notify.
Operational tuning: keep the restart loop in check
Once the watchdog kicks in, the first thing to degrade is “noise.” Put these controls in place:
- a brake via
StartLimit* - write the restart reason to the service logs (in the application’s own logs)
- only escalate the alarm in cases of “too-frequent restarts”
Sample triage commands:
systemctl status my-service --no-pager
journalctl -u my-service -n 200 --no-pager
systemctl show my-service -p NRestarts -p RestartUSec -p WatchdogUSec
Wiring alarms/automation via OnFailure
A more advanced but very useful pattern: triggering an automated action when the service fails (ticket, Slack, script). Example:
[Unit]
OnFailure=my-service-failure@%n.service
The goal here isn’t to “panic on every restart”; it’s to make recurring failures visible.
Wrap-up
The watchdog reduces the “didn’t die but got stuck” class in production and lowers MTTR. Success isn’t just turning on WatchdogSec; it’s defining the READY/WATCHDOG signal correctly, braking the restart-storm risk, and tying alarms to the right thresholds. From an operational leadership angle, this also makes “service is up” metrics more honest.