Resilience is something you buy with the work you do “while there is no problem”. And yet the idea of testing resilience in production triggers, for good reason, a perfectly healthy fear in most teams: “What if we actually break it?” In this piece I want to skip the romanticism around chaos and describe an approach that holds up under operational reality: the experiment plane.
When do I take chaos engineering seriously?
If these three conditions are present, chaos work moves from “nice to have” into the realm of risk reduction:
- Production has grown into multiple dependencies (DB, cache, queue, 3rd parties, network layer)
- The phrase “we couldn’t have tested this” keeps coming back in postmortems
- Change velocity has gone up but the SLO/SLA pressure has not dropped
The experiment plane: an architecture that makes chaos safe
The model I recommend has four layers:
- Hypothesis layer: a measurable target like “if this component lags by 2 minutes, user impact will be X.”
- Blast radius layer: apply the experiment to a targeted slice, not the whole traffic.
- Guardrail layer: automatic stop / rollback boundaries.
- Evidence layer: prove what actually happened with metrics + logs + traces.
Without this model chaos engineering tends to slide into a “show of bravery” and burns the team’s trust.
The 6 safest experiment types to start with
The experiments that produce the fewest surprises in production:
- Latency injection (e.g. 200–500ms toward a downstream)
- Error rate injection (controlled 5xx/timeout)
- Pod/VM kill (single instance termination)
- Network partition simulation (within a constrained segment)
- Rate limit / quota (gradual tightening)
- Dependency blackhole (canary ring only)
Blast radius: target it, don’t broadcast it
Practical ways to shrink blast radius:
- Release ring: limit the experiment to the canary ring
- Tenant/segment: only on a low-risk tenant
- Endpoint: only against background job endpoints
- Time window: specific minutes, when on-call is ready
The most important principle is this: the experiment’s blast radius must never exceed the speed at which you can roll it back.
Guardrails: how does the experiment stop on its own?
For chaos to be safe, “a human will notice” is not a sufficient strategy. I usually require these guardrails:
- SLO-based stop: experiment shuts off when error budget burn crosses a threshold
- Latency stop: shuts off when p95/p99 crosses a threshold
- Saturation stop: shuts off when queue depth, thread pool, conn pool fill up
- Secondary signal stop: shuts off when a business metric like “checkout success rate” drops
- Auto rollback: config flag / traffic split is reverted
Guardrails should be versioned alongside the experiment definition and go through review.
Experiment format: a one-page discipline
Once you make this format the minimum standard, chaos work stops being “individual heroics”:
- Goal: which risk are we reducing?
- Hypothesis: expected behavior (measurable)
- Preconditions: are alerts, dashboards, runbooks ready?
- Blast radius: which ring/tenant/endpoint?
- Guardrails: which signals trigger an auto-stop?
- Rollback: does it close with a single command / a single flag?
- Evidence: where are the outputs being recorded?
Success criterion: not “we didn’t break it” but “we learned”
If at the end of the experiment you cannot answer the following clearly, the experiment was wasted:
- Did the alarms we expected actually fire?
- Did the on-call flow work as designed?
- Which dashboards turned out to be insufficient?
- What was the recovery time (MTTR)?
- If the same class of incident hits again, would we recover faster?
This is where the ROI of chaos work shows up: fewer surprises, faster recovery, less operational stress.