One of the most expensive sentences uttered after an incident is this: “The logs never made it.” Network device logs are the evidence layer for events like who logged in, which command was executed, which interface flapped, and which ACL got hit. But in the syslog world, three classic problems show up over and over:
- UDP loss: under load, packets drop and the evidence is gone.
- Log storm: a single failure (e.g. a flap) generates thousands of lines and drowns the pipeline.
- Trust: without TLS, logs can be observed or tampered with in transit; the risk is even bigger on the management network.
In this post, I treat network device logging not as “a setting” but as a resilient architecture.
The goal: uninterrupted answers to three questions
The field-level success metric for a good syslog architecture:
- When the collector goes down, do logs disappear, or do they queue up?
- During a log storm, does the pipeline collapse, or is it throttled in a controlled way?
- Are the logs transported with encryption and authenticated identity?
TLS: the architecture works even when not every device supports it
In the real world, some network devices simply cannot send syslog over TLS. In that case, two practical approaches:
- Local relay: device → (UDP/TCP) → relay in the same segment → (TLS) → central collector
- Out-of-band management: carrying syslog traffic on the management network with tight ACLs
Using TLS (preferably mTLS) on the relay reduces the “eavesdropping in transit” risk and makes source validation easier on the collector side.
Buffering / Queue: what happens when the collector is down?
In production, collector outages are inevitable (maintenance, full disk, network problems). Because of that:
- Use a disk-backed queue on the relay/agent
- Set a maximum disk and a drop policy for the queue
- Watch the “queue is filling up” alarm before the “no logs” alarm fires
This approach breaks the “collector down → log loss” chain.
Log storm: manage the flood without turning it into “noise”
Typical sources of log storms:
- Interface flap (especially fiber/edge)
- Routing adjacency flap
- Authentication attempts (brute force / misconfig)
- ACL hit explosion (DDoS / scan)
Two layers against a log storm:
- Limiting at the source: severity, facility, sampling on the device side (when possible)
- Limiting in the pipeline: per-source rate limit, burst tolerance, separate queues
Timestamp: no NTP means no syslog
In syslog, time is just as important as the event itself. So:
- Devices should be tied into the NTP/chrony hierarchy
- Time drift alarms should be part of the syslog pipeline
- The gap between ingest time and event time should be observable on the collector side
A minimum “evidence set”: which logs are critical for incident and audit?
The “let’s collect everything” approach is expensive in production. My minimum evidence set:
- AAA login/logout, failed attempts
- Configuration changes (commit/save, user, source)
- Routing adjacency up/down
- Uplink interface up/down
- CPU/memory critical thresholds (when the device supports it)
Test: not once, but as a regular drill
The best way to validate this architecture is a simple drill:
- Disconnect the collector (in a controlled way)
- Generate logs for 10 minutes (e.g. test interface flap)
- Bring the connection back
- Watch the logs “flow back” and observe ordering/corruption behavior
Without this drill, “resilient syslog” is just a belief.
Closing
The syslog architecture for network devices is a critical “visibility contract” from a security and operations leadership perspective. With TLS, buffering, and log storm management, you can turn syslog from just an output into a trusted evidence channel during an incident.