In enterprise infrastructure, logging always gets squeezed between two extremes:
- “Let’s drop an agent on every host and ship everything out” → cost and complexity grow
- “Just keep it local” → during an incident the evidence vanishes and correlation becomes painful
The systemd ecosystem is already there on most Linux distros. In this post I’ll walk through the practical path I use to ship journald logs centrally with systemd-journal-upload + systemd-journal-remote, secured by mTLS, and run with a disciplined retention/disk budget.
1) Architectural call: where should journal-remote live?
The most stable topology I’ve seen in the field:
- Every host:
journald(already there) +systemd-journal-upload - Center: 2 log gateways (HA) + disk budget + backpressure
That gateway tier is responsible for:
- Terminating mTLS
- Enforcing an allow list (who is even allowed to push logs?)
- Keeping the “raw evidence” in local storage (for incident use)
- Optionally forwarding downstream (Loki/ELK/SIEM)
2) Security: mTLS and identity model
The most common mistake is leaving the log endpoint open because “we’re on the internal network.” Log ingest is an attack surface too.
Minimum model I aim for:
- TLS mandatory at the gateway
- Client cert for identity (mTLS)
- Host identity derived from cert CN/SAN (e.g.
host=web-12.prod) - Rate limit / connection limit (so you survive a log storm without folding)
3) Setup (high-level steps)
Commands vary by distribution. The point here is the runbook flow, not the exact syntax.
A) Gateway: systemd-journal-remote
- HTTPS listener
- Storage directory
- Certificate/key
Sanity check:
systemctl status systemd-journal-remote
ss -lntp | rg -n "19532|journal" || true
B) Client: systemd-journal-upload
- Gateway URL
- Client certificate
- Retry/backoff
Sanity check:
systemctl status systemd-journal-upload
journalctl -u systemd-journal-upload -n 50 --no-pager
4) Retention: there is no “infinite disk”, only a policy
Retention is the most consequential decision in a centralized log tier:
- How many days of raw logs do we keep? (e.g. 7/14/30)
- What happens when the disk fills up? (drop, rotate, or apply backpressure?)
- Is there a compliance scope? (do we need a separate WORM / S3 Object Lock tier?)
A pragmatic approach:
- Short retention at the gateway (just enough for incident evidence)
- If long retention is required, hand it off to downstream archiving (object storage)
5) Operations: what signals do I actually watch?
- Gateway disk usage + inode
- Upload queue/backpressure (where applicable)
- TLS handshake error rate (catches certificate rotation problems)
- Client failed-upload count (the real “evidence loss” risk)
These signals are what “logging is working” actually translates to in practice.
6) Incident runbook: when “logs aren’t coming through”
- Client side:
- Is
systemd-journal-uploadrunning? - Any TLS errors? (cert expiry, chain issues)
- Is DNS/route in place? (gateway reachability)
- Is
- Gateway side:
- Service up?
- Disk full?
- Are we hitting connection limits?
- Mitigation:
- If under disk pressure, temporarily tighten retention
- If a cert is the problem, fall back to a known-good cert immediately
- Permanent fix:
- Automate certificate rotation
- Disk budget + alarm
- Downstream archive (when compliance needs it)
Wrap-up
Centralized logging via systemd-journal-remote is a low-friction way to harden your evidence chain without spinning up “yet another agent” project. The real value in the field isn’t in standing the service up — it’s in operating the mTLS identity model, the retention/disk budget, and the incident runbook together as one discipline. Logs aren’t just for debugging; they’re proof of operational reality.