Time synchronization, in most teams, is left with the assumption that “it just works.” Until certificate verification breaks, Kerberos sessions drop, log correlation goes off, or ordering problems start in distributed systems. In production, clock errors are usually not the primary failure; they are a silent multiplier that simultaneously affects many layers.
In this post I describe how I built an enterprise NTP hierarchy with Chrony, and especially how I turned drift/loss conditions into alerts.
Why Chrony?
Chrony provides practical advantages in variable network conditions and in environments like VMs/cloud, where clock drift can be high. The most critical points for me:
- It models offset/drift better
- Operational visibility through
chronycis easy - Server/client modes are managed clearly
Architecture: not a single layer but a hierarchy
In an enterprise design, think of at least three layers:
- Source layer: External trusted time sources (per organizational policy)
- NTP core: A small number of well-protected Chrony servers in the internal network
- Clients: Servers, devices, cluster nodes
This hierarchy solves two problems: it reduces internet dependency and prevents every client from “going outside.”
Core NTP server: example chrony.conf
The example below is a good starting point for a basic “core” install (file path may differ by distribution):
# Upstream time sources
pool ntp.org iburst maxsources 4
# Local clock as last resort (ops kararına bağlı)
local stratum 10
# Allow only internal networks
allow 10.0.0.0/8
allow 192.168.0.0/16
# Hardening
cmdport 0
# Drift and logs
driftfile /var/lib/chrony/drift
logdir /var/log/chrony
log tracking measurements statistics
Notes:
cmdport 0reduces the attack surface by closing Chrony’s command port. If you’ll usechronycfor operations, I prefer to enable it only from the management network and in a controlled fashion.local stratum 10stabilizes “when everything is cut off”; but if used incorrectly it corrupts true time. Decide based on the organization’s risk appetite.
Client configuration: single target or multiple?
There are two approaches for clients:
- Single core target: simple, but risky during a core failure
- At least 2–3 cores: safer, but management requires a bit more care
Client example:
server ntp-core-1.example.local iburst
server ntp-core-2.example.local iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
makestep 1.0 3: Allows “step” corrections of up to 1 second on the first three sync attempts. In production, large jumps require a more controlled policy, but it’s a lifesaver in the first-boot scenario.rtcsync: Provides more stable behavior with the RTC (platform dependent).
Firewall and segmentation
Minimum network rules:
- UDP/123 only from client networks to core NTP
- Direct outbound NTP to the internet from clients is blocked
- Management commands (if any) only from the management segment
Operations: how do you monitor Chrony health?
Two commands give a quick state during an incident:
chronyc tracking
chronyc sources -v
In tracking output, especially:
Last offsetRMS offsetFrequencyLeap status
From these you can tell whether drift is a “slowly growing” issue or a “source loss.”
Practical thresholds for drift alarms
There’s no single “correct threshold”; but the following works as a starting baseline:
- Offset > 50ms: warning (lower for some systems)
- Offset > 200ms: critical (identity/certificate effects may begin)
- Source count < 2: warning
- If
Leap statusis not normal: critical
Log correlation: how do you catch a clock issue?
Time issues usually come with these symptoms:
- Certificate errors (mTLS/HTTPS)
- “token expired / not yet valid”
- Kerberos skew errors
- “Events from the future” in distributed logs
For this reason, “clock skew” alerting on the SIEM/observability side should be correlated not just from NTP metrics, but also from application error patterns.
Conclusion
When you set up NTP with the right hierarchy through Chrony, time synchronization stops being an “invisible risk” and becomes a manageable service. The real difference comes not from the configuration lines, but from binding drift/loss conditions to alerts and runbooks. Reliability in production often starts with the correct design of these “small” services.