Running GitOps for Cloud Infrastructure: The Crises I Have Lived Through
Cloud and modern infrastructure are now baked into how I build and ship. GitOps has become my default for tying it all together — automated, consistent, auditable. Git is the single source of truth, infrastructure lives as code, and changes flow through commits. The operational lift is real. So is the new failure surface that comes with it.
In this post I want to walk through the operational crises I keep running into when I manage cloud infrastructure with GitOps, and the practices that have helped me avoid the worst of them. The goal is not to talk you out of GitOps — it is to make sure you go in with eyes open.
GitOps Fundamentals and the Operational Risks That Come With Them
GitOps treats the Git repository as the canonical record of infrastructure state. When somebody on my team needs to change something, they make the change as a Git commit. An automation layer — a CI/CD pipeline or a dedicated GitOps controller — picks that commit up and reconciles it against the target environment, whether that is AWS, Azure, or GCP. The result is less manual cleanup, better traceability, and rollback paths that actually work.
The catch is that automation amplifies whatever you put in front of it. Misconfigured pipelines, thin test coverage, hidden dependencies — they all become operational crises faster than they would in a click-ops world. Cloud’s dynamism makes that even truer.
How Quickly a Bad Config Spreads
GitOps’ biggest selling point is that changes ship fast. That same speed is what makes a careless merge dangerous. A wrong CLI flag, a typo in an image tag, a stale value in a values file — once it lands, it is on hundreds or thousands of nodes within minutes. I have watched perfectly innocent-looking diffs cause outages, performance regressions, and in the worst case, data loss.
The mitigation is to put real testing in the path of every change. Unit tests, integration tests, security scans run against production-shaped environments — those are the things that let me catch a bad change before it ships. Code review with teeth helps too; the human pass is still where most “obviously wrong” changes get caught.
The Dependency Problem
Cloud platforms are graphs, not trees. Change one node and you can ripple effects into half a dozen others. When I am managing infrastructure with GitOps, ignoring that graph is one of the easier ways to set off an incident. A database upgrade can break an application that depends on a specific client library version; an IAM change can lock something else out without warning.
The way I handle this is to treat the desired state as a real artifact. Tools like Terraform and Pulumi already lean into that — their state files become the contract. I keep those files backed up, keep them in sync with Git, and document the dependencies between modules so nobody has to guess. Walking the dependency graph before a change ships is cheap; finding out about it during the rollback is not.
The Crises I See Most Often, and How GitOps Helps Solve Them
Below are three of the most common crisis shapes I see in real GitOps environments, plus the patterns that have actually worked when I have had to deal with them. They illustrate the kind of complexity GitOps brings to cloud infrastructure operations.
Scenario 1: Bad Versioning, Bad Rollouts
A developer wants to ship a new version of their app, so they update the image tag in the repo. The tag is wrong, or the new version has a regression that escaped CI. The GitOps controller does its job — it sees the diff and rolls it out — and now the broken version is what is running.
GitOps response:
- Automatic rollbacks: Most GitOps tools can revert to a previous Git revision and reconcile from there. I make sure that path is wired up and tested, so when a release goes sideways the recovery is seconds, not minutes.
- Canary and Blue/Green: Don’t ship to everyone at once. Roll new versions out to a slice first, watch the signals, then expand. The GitOps workflow can model both patterns natively.
- Health checks: Post-deploy health probes have to be the trigger for promote-or-rollback decisions. If they fail, the rollback fires automatically.
Scenario 2: Access Control and Secret Leaks
The fastest way to turn GitOps into a security incident is to either let the wrong people commit, or to commit something secret. API keys, tokens, passwords — once they hit history, you are doing key rotation under pressure. And the same automation that ships your infra will happily ship that mistake to production.
GitOps response:
- RBAC everywhere: Both the Git repository and the cloud accounts behind it need real role-based access control. Only the people who should be making a particular change should be able to.
- A real secret manager: Secrets do not live in Git. They live in HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or the equivalent, and the GitOps tooling pulls them at apply time.
- Strong code review: Every change gets a review, with reviewers who know what to look for. That is what stops a leaked credential from ever being merged in the first place.
Scenario 3: Resource Conflicts and Stuck State
When two people or two automations try to change the same resource at the same time, you get conflicts and, in the worst case, state that nobody can fully reason about. In big multi-team environments, this is one of the more common ways operations go off the rails.
GitOps response:
- One repo, clear branching rules: A single repo for infra plus clear branch policies (no direct commits to main, mandatory PRs, required reviewers) cuts down on accidental conflicts.
- State locking: Tools like Terraform store their state somewhere central — an S3 bucket, an Azure Blob container — and use locking so two operators cannot apply the same module at the same time. Set this up properly and most conflicts disappear.
- A conflict playbook: Some GitOps tools handle conflicts automatically; some do not. Either way, write down what the human procedure is, so when a conflict shows up at 3am the on-call does not have to invent one.
Observability and Security: Where I Spend the Rest of My Time
Cutting incident impact is not just about having good rollback. It is about seeing problems early and not letting bad changes through in the first place. For GitOps in cloud infrastructure, that means observability and security as first-class concerns.
Observability
GitOps gives me a baseline of traceability for free — every change is in Git history. To turn that into something I can actually operate on during an incident, I add a few more layers.
- Detailed logging: The GitOps controller and every step of the pipeline log enough to reconstruct what happened. When something breaks, those logs are how I figure out where.
- State tracking: Terraform and Pulumi state files get monitored and reconciled against Git regularly. Drift between desired state and reality is something I want to know about.
- Metrics and alerting: Real metrics on the resources I run, with alerts wired to the patterns that historically precede incidents. Most of the time, I get those alerts before the user does.
Security
Automation that ships fast also ships vulnerabilities fast. Security has to be present at every step, not bolted on at the end.
- Vulnerability scanning: Container images and infrastructure code get scanned regularly. The GitOps pipeline triggers the scans, and a critical finding blocks the deploy.
- Static analysis: Static code analysis catches a surprising amount before it ever ships — bad patterns, known anti-patterns, obvious bugs. I run it on every change.
- Penetration tests: Periodic pen-testing against the GitOps-managed environment. The way to find out if my security posture is real is to have someone test it.
Where I Want to Take GitOps Next
Running cloud infrastructure with GitOps keeps evolving, and so does my own thinking about how to do it well. A few directions I am leaning into.
- Policy as code: Tools like Open Policy Agent let me encode the rules — security baselines, compliance constraints, operational standards — as code that the pipeline enforces. That makes “no” a function of policy, not a person.
- AIOps: Using ML to chew through operational data, surface anomalies, and sometimes auto-remediate. Wired up to GitOps, that is a powerful crisis-prevention tool.
- Continuous improvement: Reviewing the GitOps workflow on a cadence, taking the lessons from incidents, and folding them back into the pipeline. The work is never quite done.
Closing Thoughts
Running cloud infrastructure with GitOps is, for me, the closest thing to a step-change in operations I have seen. The speed, the consistency, and the automation pay back operationally many times over. But the same properties make discipline non-negotiable.
The crisis patterns and mitigations in this post are the ones I keep coming back to. GitOps is more than a stack of tools — it is a working culture. With continuous learning, real testing, strong observability, and a security-first mindset, you can manage cloud infrastructure with GitOps without burning yourself.
Take that approach and the operational crises stop being roadblocks. They become inputs to the next round of improvement.