In enterprise environments, the demand for object storage usually starts as “we need an S3-compatible place”: backups, log archives, artifact stores, document retention, raw SIEM data. But the production reality is this: what keeps Ceph alive is not the install command; it is the physics of recovery during a fault. And that physics is shaped most by two decisions:
- Failure domain (at which fault boundary do you commit to not losing data?)
- Recovery budget (how fast and at what cost will you climb out of that fault?)
Where do I use Ceph, and where do I pump the brakes?
Ceph shines at: general-purpose, shared, scalable storage. With RGW (S3), RBD (block), or CephFS (file), you can serve very different needs from a single platform.
But don’t carry the workload “just because Ceph exists” in these two cases:
- The application is extremely sensitive to storage latency (especially small IOs and synchronous writes)
- Your operations team is not mature enough to handle 7/24 Ceph events (backfill/rebalance/scrub/recovery)
Failure domain: Not how many replicas, but where the replicas live
In Ceph, replica count (e.g. size=3) matters but it is an incomplete sentence. The real question is:
Are the replicas of the same data on the same host, the same rack, the same room, the same “power domain”?
The most practical failure domain layers in on-prem environments:
- host: failure of the same physical server
- rack: rack PDU/ToR failure or maintenance impact
- room / pod: room-level power or cooling events
- site / DC: loss of the data center (this tier is a separate topic: stretch cluster vs async replication)
On the Ceph side you express this with the CRUSH map. Your goal: replicas should “not die together in the same fault”.
Network and physics: The hidden drivers of recovery
The most overlooked truth about Ceph: recovery is not just traffic; disk and CPU are also in the picture. OSDs compete for both user IO and recovery.
Practical architectural decisions:
- Public / cluster network split: separate NIC/VLAN if possible, at minimum QoS-based separation
- Jumbo frames: if the infrastructure supports it (end-to-end), they cut recovery time; if not, they create noise
- Disk classes: with NVMe + HDD mixes, separate pools and use device classes deliberately
Capacity planning: “Working under normal conditions” is not enough
A Ceph capacity plan has two critical reserves:
- Fault reserve: room to redistribute replicas when an OSD/host/rack is lost
- Recovery time reserve: completing that redistribution without breaking the SLO
My simple rule that holds up in the field:
- For replicated pools (size=3) and long-lived workloads, treat utilization above 65–70% as risky
- If you’ll use erasure coding, evaluate it not for “raw capacity gain” but for rebuild cost
Operational design: What happens when an OSD goes down?
An OSD going down is normal. Don’t panic; follow the procedure.
Minimum runbook questions:
- Is the OSD “down” or “out”?
- Did a disk go away, did a host go away, did the network go away?
- Which PG states is the cluster in: “degraded” / “undersized” / “backfill_wait”?
- Are recovery throttle settings killing prod IO?
Example approach:
- First, isolate the cause: disk failure (SMART), host reboot, NIC link?
- For short events, delay the reflex of marking it “out”: a “flap” creates unnecessary data movement
- If the fault is permanent, mark the OSD out and start a controlled recovery
- Don’t open a second large change before recovery completes (especially adding/removing OSDs)
Alarms and metrics: “Degraded” alone arrives late
On the alarm side, “HEALTH_WARN” alone is not enough. I track these signals together:
PGs degraded / undersized / stalemisplacedratio and its trend- Recovery throughput (MB/s) and ETA feel
slow opsandosd perflatencies- Utilization: gap between the fullest OSD and the emptiest OSD (imbalance)
These metrics signal “the cluster is becoming unrecoverable” before “the cluster is broken”.
Security: An S3 key is as valuable as the root password
The most common mistake on the Ceph RGW side is treating S3 as an “internal service” and going lax on it. A practical checklist:
- Segment RGW (don’t let it behave like a “general service” exposed to the production network)
- Standardize TLS (in-cluster TLS termination + need for mTLS)
- Make IAM-like policies and bucket policies explicit
- Split “backup buckets” and “application buckets” into separate tenants/pools (blast radius)
Conclusion: Ceph success is as real as a “fault simulation”
What makes Ceph good is not “it works today”; it is that it remains controllable tomorrow when a rack goes away. Once you describe the failure domain correctly and back the recovery budget with capacity and the network, Ceph genuinely produces value in enterprise environments: less vendor lock-in, better scale, and more consistent operations.