On the network side, “configuration drift” is unavoidable: emergency fixes, vendor differences, on-site pressure… Instead of trying to outright “ban” drift, the sustainable answer is to detect it and fix it under control.
In this article I walk through a practical flow:
NetBox (source of truth) → Nornir (execute) → Git PR (approval) → rollout (rings).
Target architecture (minimum viable)
The leanest version of this flow runs on these components:
- NetBox: device/interface/IP/VLAN/tenant inventory
- Git repo: the “desired state” (templates + variables)
- Nornir job:
- pull the inventory from NetBox
- fetch running-config from each device
- render the “expected” output via templates
- produce a diff (the report)
- apply after approval (commit + tag)
Step 1 — Make the NetBox inventory “automation-friendly”
Practices that smooth out the drift flow on the NetBox side:
- Assign roles to devices (core/edge/access)
- Use site/region fields consistently
- Align the VLAN/VRF model with what’s actually deployed
- Add an “automation ring” custom field (canary/pilot/prod)
The goal: be able to slice the Nornir inventory by tags.
Step 2 — Nornir inventory: NetBox as the source
Two important details on the Nornir side:
- Start with a read-only NetBox API token (for the report stage)
- Use a separate token/identity for the “apply after approval” stage
This separation splits the risk between “produce a report” and “apply changes.”
Step 3 — Drift report: produce a per-device diff
The report stage aims to answer:
- Which devices are drifting?
- What class of drift is it? (ACL, routing, interface, NTP, SNMP, syslog…)
- Is the drift “expected” (a planned change) or a surprise?
I prefer producing the report in two formats:
- For humans: a Markdown summary plus the most critical diffs
- For machines: JSON (CI gate / metrics)
Step 4 — PR workflow: “an approved drift remediation”
Standardize this information inside the PR:
- The list of affected devices (by ring)
- Type of change (routing/ACL/…)
- Expected impact (risk)
- Rollback command/plan
- Change window (if any)
Step 5 — Rollout: ring by ring
For rollout discipline, I follow this sequence:
- Canary: 1–3 devices
- Pilot: a small site/tenant
- Prod: the remainder
Measure at every stage:
- Routing adjacency flap?
- Packet loss / latency?
- ACL hitcount anomaly?
- CPU spike?
Step 6 — Rollback has to be real
Rollback can’t be “in theory it exists” — it has to be runnable in practice:
- Treat the applied config change like a “transaction”
- If the vendor supports it, lean on commit/confirm features
- Keep the changes small and atomic
Closing: make drift visible first, then reduce it
The first win from this flow is not “less drift,” it is making drift visible. What is visible becomes manageable; what is manageable can be standardized.
If you’d like, the natural next step is layering “risk scoring” and “automatic maintenance window selection” on top of this flow — gates that adjust to the class of drift.