A Pre-Validation Pipeline for Network Changes with Batfish

The most expensive class of network mistake is this: the change looks correct, but somewhere it creates a reachability/ACL side effect. By the time you catch it in production the answer is “roll back” — yet in some environments rolling back is also risky (multiple dependencies, simultaneous changes, state).

In this post I describe a flow that has saved me repeatedly in the field:

take a config snapshot
run a question set with Batfish
if the result is not “as expected,” block the merge

Goal: the same six questions for every PR

My minimum set:

With this change, which prefixes became unreachable?
Did a new ACL/route-policy cut off any traffic?
Has default route / next-hop behavior shifted?
Are BGP/OSPF adjacencies behaving as expected?
Has any leak between VRFs appeared?
Did the permission you opened “only here” expand somewhere else?

Setup: bring up Batfish in a container

The most practical approach is Docker.

docker run --rm -d --name batfish -p 9997:9997 -p 9996:9996 batfish/allinone

Batfish ships with two interfaces (which can vary by version):

service (analysis engine)
client (you talk to it through pybatfish)

Snapshot structure: config + environment

A Batfish snapshot fundamentally contains:

Device configurations (in the vendor format)
(Optional) interface/status/env data

A suggested repo layout:

network/
  snapshots/
    prod/
      configs/
        r1.cfg
        r2.cfg
      hosts/
        host1.json

This way “prod snapshot” stays fixed; the PR change produces a new snapshot.

Question set: 3 critical tests that block the PR

1) Reachability: does the expected flow exist?

Sample question: “Does TCP/5432 work from the app VLAN to the DB VLAN?”

source: app subnet
destination: db subnet
protocol/port: tcp/5432

2) ACL: did an unexpected deny appear?

Especially in policies that “deny by default,” the wrong order can produce a major incident.

3) Routing: did next-hop change?

When BGP local-pref, route-map, or IGP metric is touched, an unexpected hairpin can show up.

CI/CD: a PR gate via GitHub Actions

The flow:

On the PR branch, generate a snapshot (config render / export)
Start the Batfish container
Run the question set
If it fails, the workflow fails → no merge

A simple workflow skeleton:

name: network-precheck
on:
  pull_request:
jobs:
  batfish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Start Batfish
        run: docker run --rm -d --name batfish -p 9997:9997 -p 9996:9996 batfish/allinone
      - name: Run questions
        run: |
          python -m pip install --quiet pybatfish
          python scripts/batfish_questions.py --snapshot network/snapshots/pr

I deliberately don’t include the script example in this repo, because every organization’s question set is different. But the template is the same: snapshot + questions + gate.

Operational tips (the things that make a difference in the field)

Keep the snapshot “close to reality”: use rendered configs (if templating is in play)
Keep the question set small but pick the critical ones (6–10 questions)
Format the failure output to drive action: which prefix, which ACL, which device
Attach the Batfish report to the change ticket (audit trail and trust)

Conclusion

What you’re really doing with Batfish is this: treating the network as “code” and making the change testable. Especially in large networks, this approach drops change risk dramatically and reduces the number of times you say “we noticed it in prod.”

A Pre-Validation Pipeline for Network Changes with Batfish

Goal: the same six questions for every PR

Setup: bring up Batfish in a container

Snapshot structure: config + environment

Question set: 3 critical tests that block the PR

1) Reachability: does the expected flow exist?

2) ACL: did an unexpected deny appear?

3) Routing: did next-hop change?

CI/CD: a PR gate via GitHub Actions

Operational tips (the things that make a difference in the field)

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Packet Capture in Production with tcpdump: A Runbook

Terraform CI Guardrails: Plan/Apply, Drift, and Policy Check

Goal: the same six questions for every PR

Setup: bring up Batfish in a container

Snapshot structure: config + environment

Question set: 3 critical tests that block the PR

1) Reachability: does the expected flow exist?

2) ACL: did an unexpected deny appear?

3) Routing: did next-hop change?

CI/CD: a PR gate via GitHub Actions

Operational tips (the things that make a difference in the field)

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Packet Capture in Production with tcpdump: A Runbook

Terraform CI Guardrails: Plan/Apply, Drift, and Policy Check

Klavye Kısayolları