İçeriğe Atla
Mustafa Erbay
Tutorials · 8 min read · görüntülenme Türkçe oku
100%

High Availability and Split-Brain Runbook with Redis Sentinel

A field-ready runbook for operationally managing quorum, failover, and split-brain risk in a Redis Sentinel-based HA setup.

High Availability and Split-Brain Runbook with Redis Sentinel — cover image

Redis starts in most companies as a “cache”, then quietly begins to carry critical state: rate limit counters, idempotency stores, sessions, feature flags, etc. At that point, “single-node Redis” is no longer a tech debt item; it’s a direct incident trigger.

Sentinel offers a practical HA approach for Redis; but its most dangerous failure mode is split-brain (two different masters appearing or writes going to the wrong master).

This article doesn’t cover Sentinel installation; it covers the operational runbook of a system that uses Sentinel.

1) Prerequisite: the minimum HA design standard

Minimum safe baseline:

  • 1 master + 1 replica (preferably 2 replicas)
  • At least 3 Sentinels (across separate failure domains)

On the Sentinel side, the goal is not to leave the decision of declaring the master “dead” and choosing a new master to a single node’s observation.

2) Triage: “Did Redis go down, or was it a failover?”

Symptoms:

  • Cache miss spike, latency spike
  • “READONLY You can’t write against a read only replica” error
  • Application connection errors / timeouts

2.1 Role verification

Quickly verify the master/replica role:

redis-cli -h <redis-host> -p 6379 INFO replication | rg -n \"role|master_host|master_link_status|connected_slaves\" -S

Expected:

  • master: role:master
  • replica: role:slave + master_link_status:up

2.2 Status from the Sentinel’s perspective

redis-cli -h <sentinel-host> -p 26379 SENTINEL masters
redis-cli -h <sentinel-host> -p 26379 SENTINEL slaves <master-name>
redis-cli -h <sentinel-host> -p 26379 SENTINEL get-master-addr-by-name <master-name>

Goal:

  • Does Sentinel give a consistent answer to “who is the master?”
  • Do different Sentinels show different masters? (suspicion of split-brain)

3) How is split-brain identified?

Field signals of split-brain:

  • Two different nodes can claim role:master simultaneously
  • Sentinels return different master addresses
  • Write failures mix with confusing behaviors like “read-only replica” and “MOVED/ASK”
  • The cache appears “inconsistent” (some users logged in, some not)

4) Incident response: step by step

4.1 First, stabilize writes (stop the load and damage)

Goal: reduce writes going to the wrong master.

Practical options:

  • Temporarily restrict the Redis write path in the application (feature flag / degrade)
  • Pull the cache into “read-only mode” (when possible)
  • Reduce traffic (rate limit / shed load)

4.2 Decide on the “correct master”

Decision criteria:

  • Which replica is the most up-to-date? (replication lag / offset)
  • Which node has been master longer? (failover timing)
  • Which node did the application write to? (log/metric)

Operational rule: “least data loss” should be the priority. Even if it’s a cache, some cache data can influence security decisions.

4.3 Block writes on the wrong master

If a node accidentally became master, you may need to demote it back to “replica” behavior.

Example approach (caution: depends on field conditions):

# yanlış master'ı doğru master'a replika yap (yanlış yönde ise veri kaybı yaşanabilir)
redis-cli -h <wrong-master> -p 6379 REPLICAOF <correct-master-ip> 6379

Before applying this step:

  • Reduce traffic
  • Take a snapshot if possible (RDB/AOF)
  • Get acceptance for the data loss impact (incident command)

5) Permanent protections: shrink split-brain risk

5.1 Write safety: “min-replicas” protection

On the Redis side, you can prevent the master from accepting writes without seeing replicas:

min-replicas-to-write 1
min-replicas-max-lag 10

Field interpretation:

  • If there are no replicas or lag is high, the master refuses writes → produces short-term errors but reduces split-brain damage.

5.2 Tune Sentinel parameters to “reality”

Three critical parameters:

  • down-after-milliseconds
  • failover-timeout
  • quorum

Examples of bad settings:

  • A down-after that is too low: unnecessary failover during short network jitter
  • Overly aggressive failover: flap and inconsistent master

5.3 Drills: there’s no confidence without practicing failover

Even one small drill per month makes a difference:

  • Controlled shutdown of master → measure failover duration
  • Network partition simulation (staging) → observe quorum behavior
  • Check whether app-side retries/timeouts “amplify the failover”

6) Runbook closing: verification

Stability verification:

  • Is there only a single master?
  • Are all Sentinels showing the same master?
  • Has the app error rate returned to normal?

Redis Sentinel, when set up correctly, eases operations; when set up incorrectly, it generates incidents under the name of “automatic failover”. The difference here is operational discipline more than technology: quorum, drills, and a proven runbook.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts