Most teams sit at one of two extremes:
- “Let’s release continuously” (risk grows)
- “We had an incident; freeze for two weeks” (velocity dies)
The third path, which has worked best for me in practice, is to convert the error budget into an actual control signal. In other words, manage release velocity not by gut feel but by the SLO burn.
What does the error budget actually solve?
When you write an SLO, you’re also defining the system’s “acceptable error” share. If that share is being consumed:
- Slowing down releases makes sense (risk/impact has grown)
- Pulling reliability work forward makes sense (operational debt has grown)
The real point isn’t “let’s stop releasing”; it’s releasing the right thing at the right moment.
Build the release gate as a decision model, not “a metric”
In my model the gate runs on three signals:
- Short-window burn: last 1–2 hours (is something breaking right now?)
- Mid-window burn: last 6–24 hours (is the trend bad?)
- Incident state: any active Sev1/Sev2? (is command mode on?)
Together those three produce three release decisions:
- Go: normal release
- Go (guarded): canary + slower rollout + fast rollback
- No-Go: only emergency security/business-continuity fixes
Thresholds: an example that holds up in the field
For a sample service:
- SLO: 99.9% success (monthly)
- Error budget: ~43 minutes/month
The burn logic I commonly use:
- 2-hour window burn rate > 2 → Go (guarded)
- 6-hour window burn rate > 3 → No-Go
- 24-hour window burn rate > 1.5 → Go (guarded)
These numbers aren’t a “universal truth”; the point is:
- catch fast degradation on the short window
- catch the trend on the mid window
- bind the decision to a rollout strategy
The gate’s output: automate the rollout strategy
The gate doesn’t have to mean “build fail.” With most teams I recommend this path:
- The gate result emits a release parameter
- In “guarded” mode: the canary percentage drops, the step interval stretches, automatic rollback gets more aggressive
- In “no-go” mode: only PRs labeled “break-glass” pass
A representative CI/CD check:
name: release-gate
on:
workflow_call:
outputs:
mode:
description: gate result
value: ${{ jobs.gate.outputs.mode }}
jobs:
gate:
runs-on: ubuntu-latest
outputs:
mode: ${{ steps.out.outputs.mode }}
steps:
- name: Decide gate mode
id: out
run: |
# pseudo: fetch burn rates from your metrics API
echo "mode=guarded" >> "$GITHUB_OUTPUT"
In this approach “guarded” mode lowers risk without halting the release entirely.
The operations side: who owns the gate?
The most critical piece of the gate is ownership.
- When you say “SRE approves,” does SRE actually have the capacity and authority?
- When you say “the product team decides,” do they have incident reflexes and metrics literacy?
The split that has worked for me in practice:
- The SLO definition + gate policy is owned by platform/operations leadership
- On a “No-Go” decision, the incident commander has authority
- “Go (guarded)” releases are run by the service owner, but the rollout guardrails come from the platform
Blind spots in the gate (and how to fix them)
1) Batch jobs are wrecking the SLO
Fix: measure the SLO via the “user journey”; carve out batch into its own SLO.
2) An external dependency (vendor) is burning you
Fix: define a “dependency SLO” and add it as a separate signal in the gate’s decision.
3) The gate is too noisy (false positives)
Fix: balance burn calculations across multiple windows and cross-check with incident state.
Conclusion
Designing a release gate around the error budget isn’t about “lowering release velocity”; it’s about modulating velocity by current state. With the right signals, the right thresholds, and automation tied to the rollout strategy, you can hold both safety and delivery speed inside one system.