The most expensive class of network mistake is this: the change looks correct, but somewhere it creates a reachability/ACL side effect. By the time you catch it in production the answer is “roll back” — yet in some environments rolling back is also risky (multiple dependencies, simultaneous changes, state).
In this post I describe a flow that has saved me repeatedly in the field:
- take a config snapshot
- run a question set with Batfish
- if the result is not “as expected,” block the merge
Goal: the same six questions for every PR
My minimum set:
- With this change, which prefixes became unreachable?
- Did a new ACL/route-policy cut off any traffic?
- Has default route / next-hop behavior shifted?
- Are BGP/OSPF adjacencies behaving as expected?
- Has any leak between VRFs appeared?
- Did the permission you opened “only here” expand somewhere else?
Setup: bring up Batfish in a container
The most practical approach is Docker.
docker run --rm -d --name batfish -p 9997:9997 -p 9996:9996 batfish/allinone
Batfish ships with two interfaces (which can vary by version):
- service (analysis engine)
- client (you talk to it through pybatfish)
Snapshot structure: config + environment
A Batfish snapshot fundamentally contains:
- Device configurations (in the vendor format)
- (Optional) interface/status/env data
A suggested repo layout:
network/
snapshots/
prod/
configs/
r1.cfg
r2.cfg
hosts/
host1.json
This way “prod snapshot” stays fixed; the PR change produces a new snapshot.
Question set: 3 critical tests that block the PR
1) Reachability: does the expected flow exist?
Sample question: “Does TCP/5432 work from the app VLAN to the DB VLAN?”
- source: app subnet
- destination: db subnet
- protocol/port: tcp/5432
2) ACL: did an unexpected deny appear?
Especially in policies that “deny by default,” the wrong order can produce a major incident.
3) Routing: did next-hop change?
When BGP local-pref, route-map, or IGP metric is touched, an unexpected hairpin can show up.
CI/CD: a PR gate via GitHub Actions
The flow:
- On the PR branch, generate a snapshot (config render / export)
- Start the Batfish container
- Run the question set
- If it fails, the workflow fails → no merge
A simple workflow skeleton:
name: network-precheck
on:
pull_request:
jobs:
batfish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start Batfish
run: docker run --rm -d --name batfish -p 9997:9997 -p 9996:9996 batfish/allinone
- name: Run questions
run: |
python -m pip install --quiet pybatfish
python scripts/batfish_questions.py --snapshot network/snapshots/pr
I deliberately don’t include the script example in this repo, because every organization’s question set is different. But the template is the same: snapshot + questions + gate.
Operational tips (the things that make a difference in the field)
- Keep the snapshot “close to reality”: use rendered configs (if templating is in play)
- Keep the question set small but pick the critical ones (6–10 questions)
- Format the failure output to drive action: which prefix, which ACL, which device
- Attach the Batfish report to the change ticket (audit trail and trust)
Conclusion
What you’re really doing with Batfish is this: treating the network as “code” and making the change testable. Especially in large networks, this approach drops change risk dramatically and reduces the number of times you say “we noticed it in prod.”