High Availability and Split-Brain Runbook with Redis Sentinel

Redis starts in most companies as a “cache”, then quietly begins to carry critical state: rate limit counters, idempotency stores, sessions, feature flags, etc. At that point, “single-node Redis” is no longer a tech debt item; it’s a direct incident trigger.

Sentinel offers a practical HA approach for Redis; but its most dangerous failure mode is split-brain (two different masters appearing or writes going to the wrong master).

This article doesn’t cover Sentinel installation; it covers the operational runbook of a system that uses Sentinel.

1) Prerequisite: the minimum HA design standard

Minimum safe baseline:

1 master + 1 replica (preferably 2 replicas)
At least 3 Sentinels (across separate failure domains)

On the Sentinel side, the goal is not to leave the decision of declaring the master “dead” and choosing a new master to a single node’s observation.

2) Triage: “Did Redis go down, or was it a failover?”

Symptoms:

Cache miss spike, latency spike
“READONLY You can’t write against a read only replica” error
Application connection errors / timeouts

2.1 Role verification

Quickly verify the master/replica role:

redis-cli -h <redis-host> -p 6379 INFO replication | rg -n \"role|master_host|master_link_status|connected_slaves\" -S

Expected:

master: role:master
replica: role:slave + master_link_status:up

2.2 Status from the Sentinel’s perspective

redis-cli -h <sentinel-host> -p 26379 SENTINEL masters
redis-cli -h <sentinel-host> -p 26379 SENTINEL slaves <master-name>
redis-cli -h <sentinel-host> -p 26379 SENTINEL get-master-addr-by-name <master-name>

Goal:

Does Sentinel give a consistent answer to “who is the master?”
Do different Sentinels show different masters? (suspicion of split-brain)

3) How is split-brain identified?

Field signals of split-brain:

Two different nodes can claim role:master simultaneously
Sentinels return different master addresses
Write failures mix with confusing behaviors like “read-only replica” and “MOVED/ASK”
The cache appears “inconsistent” (some users logged in, some not)

4) Incident response: step by step

4.1 First, stabilize writes (stop the load and damage)

Goal: reduce writes going to the wrong master.

Practical options:

Temporarily restrict the Redis write path in the application (feature flag / degrade)
Pull the cache into “read-only mode” (when possible)
Reduce traffic (rate limit / shed load)

4.2 Decide on the “correct master”

Decision criteria:

Which replica is the most up-to-date? (replication lag / offset)
Which node has been master longer? (failover timing)
Which node did the application write to? (log/metric)

Operational rule: “least data loss” should be the priority. Even if it’s a cache, some cache data can influence security decisions.

4.3 Block writes on the wrong master

If a node accidentally became master, you may need to demote it back to “replica” behavior.

Example approach (caution: depends on field conditions):

# yanlış master'ı doğru master'a replika yap (yanlış yönde ise veri kaybı yaşanabilir)
redis-cli -h <wrong-master> -p 6379 REPLICAOF <correct-master-ip> 6379

Before applying this step:

Reduce traffic
Take a snapshot if possible (RDB/AOF)
Get acceptance for the data loss impact (incident command)

5) Permanent protections: shrink split-brain risk

5.1 Write safety: “min-replicas” protection

On the Redis side, you can prevent the master from accepting writes without seeing replicas:

min-replicas-to-write 1
min-replicas-max-lag 10

Field interpretation:

If there are no replicas or lag is high, the master refuses writes → produces short-term errors but reduces split-brain damage.

5.2 Tune Sentinel parameters to “reality”

Three critical parameters:

down-after-milliseconds
failover-timeout
quorum

Examples of bad settings:

A down-after that is too low: unnecessary failover during short network jitter
Overly aggressive failover: flap and inconsistent master

5.3 Drills: there’s no confidence without practicing failover

Even one small drill per month makes a difference:

Controlled shutdown of master → measure failover duration
Network partition simulation (staging) → observe quorum behavior
Check whether app-side retries/timeouts “amplify the failover”

6) Runbook closing: verification

Stability verification:

Is there only a single master?
Are all Sentinels showing the same master?
Has the app error rate returned to normal?

Redis Sentinel, when set up correctly, eases operations; when set up incorrectly, it generates incidents under the name of “automatic failover”. The difference here is operational discipline more than technology: quorum, drills, and a proven runbook.

High Availability and Split-Brain Runbook with Redis Sentinel

1) Prerequisite: the minimum HA design standard

2) Triage: “Did Redis go down, or was it a failover?”

2.1 Role verification

2.2 Status from the Sentinel’s perspective

3) How is split-brain identified?

4) Incident response: step by step

4.1 First, stabilize writes (stop the load and damage)

4.2 Decide on the “correct master”

4.3 Block writes on the wrong master

5) Permanent protections: shrink split-brain risk

5.1 Write safety: “min-replicas” protection

5.2 Tune Sentinel parameters to “reality”

5.3 Drills: there’s no confidence without practicing failover

6) Runbook closing: verification

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Zero-Downtime Restart with systemd Socket Activation

1) Prerequisite: the minimum HA design standard

2) Triage: “Did Redis go down, or was it a failover?”

2.1 Role verification

2.2 Status from the Sentinel’s perspective

3) How is split-brain identified?

4) Incident response: step by step

4.1 First, stabilize writes (stop the load and damage)

4.2 Decide on the “correct master”

4.3 Block writes on the wrong master

5) Permanent protections: shrink split-brain risk

5.1 Write safety: “min-replicas” protection

5.2 Tune Sentinel parameters to “reality”

5.3 Drills: there’s no confidence without practicing failover

6) Runbook closing: verification

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Zero-Downtime Restart with systemd Socket Activation

Klavye Kısayolları