Object Storage with Ceph: Failure Domain and Recovery Design

In enterprise environments, the demand for object storage usually starts as “we need an S3-compatible place”: backups, log archives, artifact stores, document retention, raw SIEM data. But the production reality is this: what keeps Ceph alive is not the install command; it is the physics of recovery during a fault. And that physics is shaped most by two decisions:

Failure domain (at which fault boundary do you commit to not losing data?)
Recovery budget (how fast and at what cost will you climb out of that fault?)

Where do I use Ceph, and where do I pump the brakes?

Ceph shines at: general-purpose, shared, scalable storage. With RGW (S3), RBD (block), or CephFS (file), you can serve very different needs from a single platform.

But don’t carry the workload “just because Ceph exists” in these two cases:

The application is extremely sensitive to storage latency (especially small IOs and synchronous writes)
Your operations team is not mature enough to handle 7/24 Ceph events (backfill/rebalance/scrub/recovery)

Failure domain: Not how many replicas, but where the replicas live

In Ceph, replica count (e.g. size=3) matters but it is an incomplete sentence. The real question is:

Are the replicas of the same data on the same host, the same rack, the same room, the same “power domain”?

The most practical failure domain layers in on-prem environments:

host: failure of the same physical server
rack: rack PDU/ToR failure or maintenance impact
room / pod: room-level power or cooling events
site / DC: loss of the data center (this tier is a separate topic: stretch cluster vs async replication)

On the Ceph side you express this with the CRUSH map. Your goal: replicas should “not die together in the same fault”.

Network and physics: The hidden drivers of recovery

The most overlooked truth about Ceph: recovery is not just traffic; disk and CPU are also in the picture. OSDs compete for both user IO and recovery.

Practical architectural decisions:

Public / cluster network split: separate NIC/VLAN if possible, at minimum QoS-based separation
Jumbo frames: if the infrastructure supports it (end-to-end), they cut recovery time; if not, they create noise
Disk classes: with NVMe + HDD mixes, separate pools and use device classes deliberately

Capacity planning: “Working under normal conditions” is not enough

A Ceph capacity plan has two critical reserves:

Fault reserve: room to redistribute replicas when an OSD/host/rack is lost
Recovery time reserve: completing that redistribution without breaking the SLO

My simple rule that holds up in the field:

For replicated pools (size=3) and long-lived workloads, treat utilization above 65–70% as risky
If you’ll use erasure coding, evaluate it not for “raw capacity gain” but for rebuild cost

Operational design: What happens when an OSD goes down?

An OSD going down is normal. Don’t panic; follow the procedure.

Minimum runbook questions:

Is the OSD “down” or “out”?
Did a disk go away, did a host go away, did the network go away?
Which PG states is the cluster in: “degraded” / “undersized” / “backfill_wait”?
Are recovery throttle settings killing prod IO?

Example approach:

First, isolate the cause: disk failure (SMART), host reboot, NIC link?
For short events, delay the reflex of marking it “out”: a “flap” creates unnecessary data movement
If the fault is permanent, mark the OSD out and start a controlled recovery
Don’t open a second large change before recovery completes (especially adding/removing OSDs)

Alarms and metrics: “Degraded” alone arrives late

On the alarm side, “HEALTH_WARN” alone is not enough. I track these signals together:

PGs degraded / undersized / stale
misplaced ratio and its trend
Recovery throughput (MB/s) and ETA feel
slow ops and osd perf latencies
Utilization: gap between the fullest OSD and the emptiest OSD (imbalance)

These metrics signal “the cluster is becoming unrecoverable” before “the cluster is broken”.

Security: An S3 key is as valuable as the root password

The most common mistake on the Ceph RGW side is treating S3 as an “internal service” and going lax on it. A practical checklist:

Segment RGW (don’t let it behave like a “general service” exposed to the production network)
Standardize TLS (in-cluster TLS termination + need for mTLS)
Make IAM-like policies and bucket policies explicit
Split “backup buckets” and “application buckets” into separate tenants/pools (blast radius)

Conclusion: Ceph success is as real as a “fault simulation”

What makes Ceph good is not “it works today”; it is that it remains controllable tomorrow when a rack goes away. Once you describe the failure domain correctly and back the recovery budget with capacity and the network, Ceph genuinely produces value in enterprise environments: less vendor lock-in, better scale, and more consistent operations.

Object Storage with Ceph: Failure Domain and Recovery Design

Where do I use Ceph, and where do I pump the brakes?

Failure domain: Not how many replicas, but where the replicas live

Network and physics: The hidden drivers of recovery

Capacity planning: “Working under normal conditions” is not enough

Operational design: What happens when an OSD goes down?

Alarms and metrics: “Degraded” alone arrives late

Security: An S3 key is as valuable as the root password

Conclusion: Ceph success is as real as a “fault simulation”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Feature Flags and Configuration Governance: Parameter Store and Audit

Retry Storms: Timeout Budget and Latency Amplification

Online Schema Migration: Expand/Contract, Backfill, and Dual Write

Where do I use Ceph, and where do I pump the brakes?

Failure domain: Not how many replicas, but where the replicas live

Network and physics: The hidden drivers of recovery

Capacity planning: “Working under normal conditions” is not enough

Operational design: What happens when an OSD goes down?

Alarms and metrics: “Degraded” alone arrives late

Security: An S3 key is as valuable as the root password

Conclusion: Ceph success is as real as a “fault simulation”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Feature Flags and Configuration Governance: Parameter Store and Audit

Retry Storms: Timeout Budget and Latency Amplification

Online Schema Migration: Expand/Contract, Backfill, and Dual Write

Klavye Kısayolları