İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

Object Storage with Ceph: Failure Domain and Recovery Design

Beyond installing Ceph: an architectural approach to failure domain, capacity, and recovery behavior so the cluster can actually heal during a fault.

Object Storage with Ceph: Failure Domain and Recovery Design — cover image

In enterprise environments, the demand for object storage usually starts as “we need an S3-compatible place”: backups, log archives, artifact stores, document retention, raw SIEM data. But the production reality is this: what keeps Ceph alive is not the install command; it is the physics of recovery during a fault. And that physics is shaped most by two decisions:

  1. Failure domain (at which fault boundary do you commit to not losing data?)
  2. Recovery budget (how fast and at what cost will you climb out of that fault?)

Where do I use Ceph, and where do I pump the brakes?

Ceph shines at: general-purpose, shared, scalable storage. With RGW (S3), RBD (block), or CephFS (file), you can serve very different needs from a single platform.

But don’t carry the workload “just because Ceph exists” in these two cases:

  • The application is extremely sensitive to storage latency (especially small IOs and synchronous writes)
  • Your operations team is not mature enough to handle 7/24 Ceph events (backfill/rebalance/scrub/recovery)

Failure domain: Not how many replicas, but where the replicas live

In Ceph, replica count (e.g. size=3) matters but it is an incomplete sentence. The real question is:

Are the replicas of the same data on the same host, the same rack, the same room, the same “power domain”?

The most practical failure domain layers in on-prem environments:

  • host: failure of the same physical server
  • rack: rack PDU/ToR failure or maintenance impact
  • room / pod: room-level power or cooling events
  • site / DC: loss of the data center (this tier is a separate topic: stretch cluster vs async replication)

On the Ceph side you express this with the CRUSH map. Your goal: replicas should “not die together in the same fault”.

Network and physics: The hidden drivers of recovery

The most overlooked truth about Ceph: recovery is not just traffic; disk and CPU are also in the picture. OSDs compete for both user IO and recovery.

Practical architectural decisions:

  • Public / cluster network split: separate NIC/VLAN if possible, at minimum QoS-based separation
  • Jumbo frames: if the infrastructure supports it (end-to-end), they cut recovery time; if not, they create noise
  • Disk classes: with NVMe + HDD mixes, separate pools and use device classes deliberately

Capacity planning: “Working under normal conditions” is not enough

A Ceph capacity plan has two critical reserves:

  1. Fault reserve: room to redistribute replicas when an OSD/host/rack is lost
  2. Recovery time reserve: completing that redistribution without breaking the SLO

My simple rule that holds up in the field:

  • For replicated pools (size=3) and long-lived workloads, treat utilization above 65–70% as risky
  • If you’ll use erasure coding, evaluate it not for “raw capacity gain” but for rebuild cost

Operational design: What happens when an OSD goes down?

An OSD going down is normal. Don’t panic; follow the procedure.

Minimum runbook questions:

  • Is the OSD “down” or “out”?
  • Did a disk go away, did a host go away, did the network go away?
  • Which PG states is the cluster in: “degraded” / “undersized” / “backfill_wait”?
  • Are recovery throttle settings killing prod IO?

Example approach:

  1. First, isolate the cause: disk failure (SMART), host reboot, NIC link?
  2. For short events, delay the reflex of marking it “out”: a “flap” creates unnecessary data movement
  3. If the fault is permanent, mark the OSD out and start a controlled recovery
  4. Don’t open a second large change before recovery completes (especially adding/removing OSDs)

Alarms and metrics: “Degraded” alone arrives late

On the alarm side, “HEALTH_WARN” alone is not enough. I track these signals together:

  • PGs degraded / undersized / stale
  • misplaced ratio and its trend
  • Recovery throughput (MB/s) and ETA feel
  • slow ops and osd perf latencies
  • Utilization: gap between the fullest OSD and the emptiest OSD (imbalance)

These metrics signal “the cluster is becoming unrecoverable” before “the cluster is broken”.

Security: An S3 key is as valuable as the root password

The most common mistake on the Ceph RGW side is treating S3 as an “internal service” and going lax on it. A practical checklist:

  • Segment RGW (don’t let it behave like a “general service” exposed to the production network)
  • Standardize TLS (in-cluster TLS termination + need for mTLS)
  • Make IAM-like policies and bucket policies explicit
  • Split “backup buckets” and “application buckets” into separate tenants/pools (blast radius)

Conclusion: Ceph success is as real as a “fault simulation”

What makes Ceph good is not “it works today”; it is that it remains controllable tomorrow when a rack goes away. Once you describe the failure domain correctly and back the recovery budget with capacity and the network, Ceph genuinely produces value in enterprise environments: less vendor lock-in, better scale, and more consistent operations.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts