İçeriğe Atla
Mustafa Erbay
Career · 8 min read · görüntülenme Türkçe oku
100%

Post-Change Verification Cadence: Smoke, SLO, and Rollback

Assuming the release is done is how you summon an incident. A practical framework for turning post-change verification into a cadence: fast smoke checks…

Post-Change Verification Cadence: Smoke, SLO, and Rollback — cover image

The most expensive sentence on an enterprise platform is: “Deploy is done.”
Most incidents I’ve seen are not born from the deploy itself — they’re born from the fact that post-deploy verification stays a “personal reflex” rather than a team habit. Without a verification cadence:

  • Surprises only show up when the change actually meets real user traffic,
  • Failure signals get noticed late,
  • The rollback decision turns into a debate.

This post lays out a framework that pulls post-change verification into a cadence simple enough that teams will actually run it.

The goal: define what “released” means

A release isn’t just an artifact going to production. A release answers three questions:

  1. Is the system serving traffic? (smoke)
  2. Are SLOs close to normal? (observation)
  3. If it breaks, do we have a clear way back? (rollback)

A 3-layer verification cadence

Layer 1 — 2 to 5 minute smoke checks

Goal: catch the most basic breakages quickly.

  • Are the critical endpoints returning 200/ok?
  • Is the auth/SSO flow working?
  • Are the basic dependencies (DB, cache, queue) healthy?
  • Is traffic distributing across the right nodes via the LB?

Smoke isn’t “deep analysis” — it’s an “anything obviously broken?” check.

Layer 2 — 15 to 30 minute SLO observation

Goal: measure user-facing impact.

  • Error rate (5xx / app error)
  • p95/p99 latency
  • Saturation (CPU, mem, conntrack, DB pool)
  • Critical queue/backlog metrics

The crucial ingredient at this layer is the baseline: if you don’t know “normal”, every decision turns into an argument.

Layer 3 — 24-hour stability window (light touch)

Goal: catch the slow-burn problems.

  • Memory leak / fd leak
  • Cache behavior (stampede, eviction)
  • Bugs that surface only when traffic patterns shift

This stage doesn’t need a full-time watch — but the dashboards and alarm thresholds had better be right.

Write the rollback criterion up front

If you make the rollback call mid-incident, team psychology and communication pressure will degrade decision quality. A simple ruleset:

  • If SLO is clearly degrading inside 10 minutes → rollback
  • If a critical flow is broken → rollback
  • If there’s any risk to data consistency → rollback, not “fix forward”

Small automations that earn their keep

Practices that keep verification independent of any one person:

  • An automatic smoke job after deploy (CI/CD or a runbook script)
  • A canary dashboard link (one click)
  • A documented “rollback command” (clear steps, even if it isn’t literally one command)
  • Auto-attach a metrics snapshot to the change ticket (just evidence, no narration)

Communication: the single-paragraph format

A short, evidence-focused paragraph after a change works well:

  • Change: what shipped?
  • Verification: which smoke check + which metric window?
  • Result: SLOs normal, any risk?
  • Plan: monitoring duration and rollback criterion

Wrap-up

Post-change verification isn’t a “good intention” — it’s an operational discipline. Once you put the smoke + SLO + rollback-criterion trio on a cadence, MTTR drops, the debates fade, and the release culture gets faster while trust grows alongside it. In production, the goal isn’t only to ship a new feature — it’s to prove it actually works once it’s out there.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

Frequently Asked Questions

Common questions readers have about this article.

How do I set up the 2‑5 minute smoke check after a release?
I start by scripting a tiny health‑check suite that hits only the critical public endpoints—typically `/health`, the auth redirect, and a couple of high‑traffic API calls. I run the script from a CI job that triggers immediately after the deployment pipeline marks the image as pushed. The job polls the load balancer for 2‑5 minutes, retries every 10 seconds, and fails fast if any response isn’t 200 or the latency spikes above a pre‑set threshold. I keep the suite under 30 seconds total so it never blocks the release pipeline, and I store the results in a shared dashboard for the whole team to see instantly.
Which monitoring tools work best for the 15‑30 minute SLO observation layer?
In my experience, a combination of Prometheus for raw metrics and Grafana for visual baselines gives the quickest feedback loop. I scrape error‑rate, p95/p99 latency, CPU, memory, and queue depth every 10 seconds, then set alert rules that compare the current window to the rolling 7‑day average. If you prefer a managed solution, Datadog’s SLO module does the heavy lifting, but you still need to define a “normal” baseline yourself. The key is to automate the comparison—don’t rely on eyeballing charts. I also tag the metrics with the release version so you can slice the data post‑mortem without digging through logs.
What are the common pitfalls when deciding to roll back, and how can I avoid a debate?
I’ve seen teams stall because the rollback path wasn’t codified before the deploy. The most common mistake is assuming the previous version is still runnable; in reality, database migrations or feature flags can lock you out. To avoid the debate, I write a rollback checklist into the definition of done: a one‑click script that redeploys the prior artifact, reverts config, and restores feature flags. I also run a dry‑run in a staging environment during the smoke phase so the team knows exactly what will happen. When the SLO window shows a breach, the decision is binary—run the script or continue monitoring—removing any subjective discussion.
Is it really necessary to treat verification as part of the definition of done, or is that just hype?
I once let a team ship without a formal verification step and paid the price with a three‑hour outage that could have been caught in minutes. Treating verification as part of the definition of done forces the habit: no code moves to production unless the smoke suite passes and the SLO window is green. It also gives leadership a concrete metric to enforce reliability culture. The downside is a tiny increase in cycle time, but the trade‑off is massive—fewer firefights, clearer rollback triggers, and a measurable improvement in mean time to recovery. In practice, the ROI is undeniable; the hype is just the industry catching up to what I’ve been doing for years.
ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts