A meaningful chunk of the outages I have seen in enterprise systems are not “we deployed code and broke it.” They are “we changed configuration and broke it.” Configuration tends to be the thing that:
- Has no clear owner
- Is not versioned (nobody can say what changed when)
- Is not tested (production is the test)
- Cannot be rolled back
This post is about treating configuration not as a file but as a control plane — and the governance that comes with that.
1) The three classes of configuration
Step one is not putting everything in the same bucket.
- Secrets (passwords, tokens, private keys)
- Parameters (timeouts, thresholds, endpoints, quotas)
- Feature flags (turning behavior on/off, degradation, canary)
Each of those three has a different storage, access, and audit profile.
2) Parameter store: why and when
A parameter store (or a configuration service) is what gives me:
- Centralized change management
- Audit logs (who changed what, when)
- Per-environment separation (dev/stage/prod)
- Rollback to a prior version
But the store itself is not the answer. The real value is the process you build around it.
3) Feature flag discipline: the kill-switch is a product feature
Feature flags earn their keep for two reasons:
- You can roll out a risky change in a controlled way (canary)
- You can shut a problem off fast (kill-switch)
Used badly, they become a “flag graveyard” you have to dig out of later.
The rules I keep coming back to:
- Every flag has an owner
- Every flag has a TTL (temporary flags expire and get cleaned up)
- Flag changes are audited
- Flag values are human-readable (no magic numbers)
4) Schema and validation: don’t ship a wrong config to prod
The most expensive part of a config change is when a wrong value spreads silently.
The minimum validation set:
- A schema (JSON schema, zod, etc.)
- Range checks (a timeout cannot be 0)
- A “dangerous value” denylist (e.g. things like
0.0.0.0/0) - A smoke test after the change applies
5) Approval flow and blast radius: the “small change” myth
Two questions for every config change:
- What is the blast radius? (How many services, how many users, how many regions?)
- How fast is the rollback? (One click, or a redeploy?)
The operational defaults I use:
- Production changes need a “two-person” sign-off, at least for the critical parameters
- Canary first — apply to a small user segment before going wide
- Rollback to the previous version is automatic, not a manual scramble
6) Observability: publish the config change as an event
When a config changes, emit:
- A “config event” that looks structurally like a deploy event
- The new version / id of the active value set (the id, not the values themselves)
- A marker on dashboards that says “config changed here”
Wire that up and during an incident the question “did this latency spike come from a config change?” stops being a half-hour investigation and becomes a five-second answer.
Configuration governance is not a brake on velocity — done well, it is an accelerant. More controlled changes, faster rollbacks, fewer incidents. Feature flags and a parameter store only deliver real value when you put schema, audit, and process around them.