VRRP Failover for the Management Plane with Keepalived

Management plane services are usually a second-class citizen in most organizations, but their failure has outsized impact. When services like an internal DNS panel, bastion portal, configuration tool, or operations API are tied to a single IP or single VM, maintenance windows generate unnecessary risk. For situations that don’t need a full-blown load balancer but still demand basic continuity, VRRP failover via Keepalived is a clean and effective option.

Technical diagram showing VRRP-based virtual IP failover between two management nodes — The point of VRRP is not load distribution; it’s moving the management entry point predictably.

When does it make sense?

This model is particularly useful for services like:

Internal-facing bastion or jump services
Management APIs
Lightweight dashboard or portal components
Tools that aren’t distributed but tolerate little to no downtime

If your application is stateless and horizontally scalable enough to behave active-active, you’ll want different solutions. The Keepalived side is more suited to making the entry point highly available.

How is the architecture set up?

The simplest model uses two Linux nodes:

node-a starts as MASTER
node-b is BACKUP
Both nodes send VRRP advertisements on the same L2 segment
A shared VIP is assigned to the service

Under normal conditions, traffic reaches node-a via the VIP. When the health check breaks, Keepalived lowers the priority and the VIP moves to node-b.

A simple configuration example

The basic logic is this:

vrrp_script chk_mgmt {
  script "/usr/local/bin/check-mgmt.sh"
  interval 2
  fall 2
  rise 2
}

vrrp_instance VI_MGMT {
  state MASTER
  interface eth0
  virtual_router_id 51
  priority 150
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass StrongSharedSecret
  }
  virtual_ipaddress {
    10.20.30.50/24
  }
  track_script {
    chk_mgmt
  }
}

On the BACKUP node, the same block is defined with a lower priority. The truly critical detail is that the health-check script must really validate the service. Just checking that the process is running often isn’t enough.

Which network details get overlooked?

Most issues in VRRP setups don’t come from the application — they come from network behavior:

Gratuitous ARP being suppressed by a switch or security layer
MTU or VLAN mismatches
Another segment colliding on the same virtual_router_id
nopreempt behavior not matching the requirement

If you want operations without a maintenance window, testing failover behavior in a controlled way before production is mandatory.

Which controls should you add for operations?

Once the setup is complete, these signals must be monitored:

VRRP state transitions
Health check failure counts
VIP transition duration
Application response after failover
ARP table convergence latency

Without these metrics, claiming you’ve built a “highly available” setup is premature.

Conclusion

VRRP failover for the management plane with Keepalived is a pragmatic way to harden the entry point without standing up an expensive, heavyweight high-availability architecture. Combined with the right health checking, network behavior knowledge, and measurement, it materially reduces risk — particularly for internal operations services.

VRRP Failover for the Management Plane with Keepalived

When does it make sense?

How is the architecture set up?

A simple configuration example

Which network details get overlooked?

Which controls should you add for operations?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Packet Capture in Production with tcpdump: A Runbook

A Safe Migration Runbook from iptables to nftables

When does it make sense?

How is the architecture set up?

A simple configuration example

Which network details get overlooked?

Which controls should you add for operations?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Packet Capture in Production with tcpdump: A Runbook

A Safe Migration Runbook from iptables to nftables

Klavye Kısayolları