In teams using Terraform, you’ll typically see two extreme stances:
- “Let’s apply on every PR, we move fast.” (risk grows)
- “Apply only once a week, we’re afraid of changes.” (speed dies)
A good CI design separates plan from apply, makes drift visible, and uses policy-as-code to stop “the wrong change” before it even merges. This post walks through a guardrail set that holds up in the field for prod infrastructure changes, framed in tool-agnostic principles.
Goal: “PR = decision,” “Apply = action”
In the PR, the goal is to:
- Understand the change (review)
- See its impact (plan)
- Verify policy compliance (policy check)
In the apply phase, the goal is to:
- Apply an approved change in a controlled way
- Have a rollback plan ready
This separation is foundational for both security and operational maturity.
Minimum guardrail set
My “minimum viable” set of guardrails:
- Format + validate (every PR)
- Plan (every PR; environment-specific)
- Policy-as-code (against the plan output)
- Drift detection (daily/weekly)
- Apply only under protected conditions (manual approval + branch protection)
Plan strategy: which environment do I plan against?
A single “prod plan” approach isn’t right for every repo. Practical choices:
- Small team / few environments: plan
dev+prodin the PR - Large team / many environments: plan
devin the PR, plan + approveprodpost-merge
The decision here is about secrets, cost, and attack surface. Handing prod credentials to a PR context is unacceptable in some organizations.
Policy check: reading the plan and being able to say “stop”
The point of a policy check:
- “This change works technically” isn’t enough
- The aim is to automatically answer “Does this change fit the organization?”
Sample policy rules:
- No public S3 buckets
- Restrict security group inbound from
0.0.0.0/0 - KMS encryption is mandatory
- Cap prod DB instance class
- Tagging standard required (cost center, owner)
Tool names may vary; the important thing is to read the resource’s intent from the plan output.
Drift: what to do when “real world” diverges from “the repo”?
Drift is the starting point of most IaC accidents:
- A hotfix was applied via the console (and forgotten)
- A bypass was used during an incident
- A system outside the automation made a change
Two practical rules for drift detection:
- Run plan regularly (without applying)
- If there’s drift, open an issue and assign an owner
Critical: manage drift as a “signal,” not a “crime.” Drift is either a process gap or a gap in IaC coverage.
Apply control: who, when, under what conditions?
The safest model for prod apply in the field:
- Apply only from
main(or a release branch) - Manual approval (at least 2 people)
- The apply job runs on a restricted runner (network + IAM)
- State backend lock and audit are enabled
Additional guardrails:
- “Destructive change” warning (via plan parsing)
- “Maintenance window” check
- Slack/Teams notifications and a change record
Many repos / many workspaces: what changes at scale?
At scale, two risks grow:
- Number of workspaces (complexity)
- Scope of authority (blast radius)
Good practices at this point:
- Module versioning (registry or git tags)
- Environment folders (separate state)
- “Per-service” ownership and review
- Staged rollout (staging first, then prod)
Closing: think of the guardrails as a “pipeline”
In Terraform CI the goal isn’t to write a single “perfect YAML”; it’s to catch risky changes early and to make prod application controlled.
Small starter steps I recommend:
fmt + validate + planin the PR- Policy check against the plan
- Run the drift plan on a regular cadence
- Limit prod apply to manual approval and a hardened runner