Redis starts in most companies as a “cache”, then quietly begins to carry critical state: rate limit counters, idempotency stores, sessions, feature flags, etc. At that point, “single-node Redis” is no longer a tech debt item; it’s a direct incident trigger.
Sentinel offers a practical HA approach for Redis; but its most dangerous failure mode is split-brain (two different masters appearing or writes going to the wrong master).
This article doesn’t cover Sentinel installation; it covers the operational runbook of a system that uses Sentinel.
1) Prerequisite: the minimum HA design standard
Minimum safe baseline:
- 1 master + 1 replica (preferably 2 replicas)
- At least 3 Sentinels (across separate failure domains)
On the Sentinel side, the goal is not to leave the decision of declaring the master “dead” and choosing a new master to a single node’s observation.
2) Triage: “Did Redis go down, or was it a failover?”
Symptoms:
- Cache miss spike, latency spike
- “READONLY You can’t write against a read only replica” error
- Application connection errors / timeouts
2.1 Role verification
Quickly verify the master/replica role:
redis-cli -h <redis-host> -p 6379 INFO replication | rg -n \"role|master_host|master_link_status|connected_slaves\" -S
Expected:
- master:
role:master - replica:
role:slave+master_link_status:up
2.2 Status from the Sentinel’s perspective
redis-cli -h <sentinel-host> -p 26379 SENTINEL masters
redis-cli -h <sentinel-host> -p 26379 SENTINEL slaves <master-name>
redis-cli -h <sentinel-host> -p 26379 SENTINEL get-master-addr-by-name <master-name>
Goal:
- Does Sentinel give a consistent answer to “who is the master?”
- Do different Sentinels show different masters? (suspicion of split-brain)
3) How is split-brain identified?
Field signals of split-brain:
- Two different nodes can claim
role:mastersimultaneously - Sentinels return different master addresses
- Write failures mix with confusing behaviors like “read-only replica” and “MOVED/ASK”
- The cache appears “inconsistent” (some users logged in, some not)
4) Incident response: step by step
4.1 First, stabilize writes (stop the load and damage)
Goal: reduce writes going to the wrong master.
Practical options:
- Temporarily restrict the Redis write path in the application (feature flag / degrade)
- Pull the cache into “read-only mode” (when possible)
- Reduce traffic (rate limit / shed load)
4.2 Decide on the “correct master”
Decision criteria:
- Which replica is the most up-to-date? (replication lag / offset)
- Which node has been master longer? (failover timing)
- Which node did the application write to? (log/metric)
Operational rule: “least data loss” should be the priority. Even if it’s a cache, some cache data can influence security decisions.
4.3 Block writes on the wrong master
If a node accidentally became master, you may need to demote it back to “replica” behavior.
Example approach (caution: depends on field conditions):
# yanlış master'ı doğru master'a replika yap (yanlış yönde ise veri kaybı yaşanabilir)
redis-cli -h <wrong-master> -p 6379 REPLICAOF <correct-master-ip> 6379
Before applying this step:
- Reduce traffic
- Take a snapshot if possible (RDB/AOF)
- Get acceptance for the data loss impact (incident command)
5) Permanent protections: shrink split-brain risk
5.1 Write safety: “min-replicas” protection
On the Redis side, you can prevent the master from accepting writes without seeing replicas:
min-replicas-to-write 1
min-replicas-max-lag 10
Field interpretation:
- If there are no replicas or lag is high, the master refuses writes → produces short-term errors but reduces split-brain damage.
5.2 Tune Sentinel parameters to “reality”
Three critical parameters:
down-after-millisecondsfailover-timeoutquorum
Examples of bad settings:
- A
down-afterthat is too low: unnecessary failover during short network jitter - Overly aggressive failover: flap and inconsistent master
5.3 Drills: there’s no confidence without practicing failover
Even one small drill per month makes a difference:
- Controlled shutdown of master → measure failover duration
- Network partition simulation (staging) → observe quorum behavior
- Check whether app-side retries/timeouts “amplify the failover”
6) Runbook closing: verification
Stability verification:
- Is there only a single master?
- Are all Sentinels showing the same master?
- Has the app error rate returned to normal?
Redis Sentinel, when set up correctly, eases operations; when set up incorrectly, it generates incidents under the name of “automatic failover”. The difference here is operational discipline more than technology: quorum, drills, and a proven runbook.