İçeriğe Atla
Mustafa Erbay
Tutorials Written by human · 8 min read · görüntülenme Türkçe oku
100%

Self-Healing Services with systemd Watchdog

Reduce 'stuck but not dead' failures with systemd WatchdogSec + notify: unit configuration, restart policy, and alarm integration.

Self-Healing Services with systemd Watchdog — true story cover image

One of the most dangerous failure classes in production is “the process is alive but not doing any work.” CPU is low, the port is open, the health check is green… yet the queue grows, latency creeps up, and the system “dies slowly.” It’s hard to catch this kind of stall with Restart=always alone, because the process never actually crashes.

The systemd watchdog is a solid tool against precisely this class: if the service doesn’t emit an “I’m alive” signal at a certain cadence, systemd restarts the service.

What does the watchdog solve and what doesn’t it solve?

What it solves:

  • deadlocks / event loop lockups
  • “infinite hang” while waiting on an external dependency
  • loss of “progress” inside the service

What it doesn’t solve:

  • incorrect business logic (the service replies wrong but is still “alive”)
  • dependency issues (if the DB is down, restart loops can make things worse)

Baseline unit file (Type=notify + WatchdogSec)

For the watchdog to work, your service has to emit READY/WATCHDOG signals via sd_notify. The systemd-side baseline:

[Unit]
Description=My Service (with watchdog)
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
NotifyAccess=main

ExecStart=/usr/local/bin/my-service

WatchdogSec=30s
Restart=on-failure
RestartSec=3s

TimeoutStartSec=30s
TimeoutStopSec=20s

StartLimitIntervalSec=120s
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

What this unit is saying:

  • if the service doesn’t emit a “WATCHDOG=1” signal for more than 30 seconds, it gets restarted
  • if it crashes, it gets restarted
  • if it restarts too frequently (burst), systemd applies the brakes

Service side: how do we generate the health signal?

Two practical approaches:

1) Adding notify inside the application (the most correct path)

The thread/event loop where the service runs emits sd_notify("WATCHDOG=1") at regular intervals. The READY signal should be sent at the moment “I can really take traffic now.”

2) The wrapper-based partial approach (limited but sometimes enough)

If the application can’t be modified, there are patterns that try to produce a “progress” signal via a wrapper process; but for true watchdog behavior you still need notify integration. So the long-term goal is for the application itself to support notify.

Operational tuning: keep the restart loop in check

Once the watchdog kicks in, the first thing to degrade is “noise.” Put these controls in place:

  • a brake via StartLimit*
  • write the restart reason to the service logs (in the application’s own logs)
  • only escalate the alarm in cases of “too-frequent restarts”

Sample triage commands:

systemctl status my-service --no-pager
journalctl -u my-service -n 200 --no-pager
systemctl show my-service -p NRestarts -p RestartUSec -p WatchdogUSec

Wiring alarms/automation via OnFailure

A more advanced but very useful pattern: triggering an automated action when the service fails (ticket, Slack, script). Example:

[Unit]
OnFailure=my-service-failure@%n.service

The goal here isn’t to “panic on every restart”; it’s to make recurring failures visible.

Wrap-up

The watchdog reduces the “didn’t die but got stuck” class in production and lowers MTTR. Success isn’t just turning on WatchdogSec; it’s defining the READY/WATCHDOG signal correctly, braking the restart-storm risk, and tying alarms to the right thresholds. From an operational leadership angle, this also makes “service is up” metrics more honest.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts