“We have a load balancer, we’ll just scale” starts as a true statement; but once state enters the picture the story changes. For applications with WebSocket, long-lived TCP connections, in-memory sessions, or local disk dependencies, sticky sessions (session affinity) look like an “easy fix.” In production, sticky behaves as stability when used right; when used wrong, it becomes an incident accelerator.
In this post I lay out when sticky sessions are reasonable, when they generate technical debt, and most importantly the decision matrix from the lens of operational manageability.
Why do sticky sessions appear in the first place?
The most common reasons:
- The application keeps state on the node (memory cache, session store, local file)
- There are long-lived connections like WebSocket/streaming
- Some dependencies are sensitive to connection-level behaviour (e.g. stateful upstream, legacy middleware)
- Local state was chosen “for speed” instead of an external session store
The hidden cost of sticky sessions
Sticky usually produces these costs:
- Imbalanced load: some nodes saturated, others idle
- Canary/blue-green friction: a portion of users get “stuck” on the old node
- Failover shock: when a node dies, every user pinned to it disconnects at once
- Harder debugging: “which user is on which node?” eats time during incidents
Decision matrix: Sticky or stateless?
A practical question I ask in the field:
1) Can state be moved out?
- Yes -> aim for stateless:
- move the session to an external store (such as Redis)
- share or make the cache reproducible
- move the file dependency to object storage
- No (in the short term) -> sticky may be temporary; but you need an exit plan
2) Are connections long-lived?
- WebSocket / streaming -> instead of sticky, connection draining + graceful shutdown are usually what truly matter
- Short HTTP requests -> sticky is generally less justified
3) Do you have a failure budget?
- Are you accepting that “a node can die” and limiting user impact?
- If you go sticky, the impact of losing a node is concentrated; will you carry that risk?
Scenarios where sticky is “the right call”
Sticky can genuinely be pragmatic:
- The legacy application’s state is deeply tangled and refactoring isn’t realistic in the short term
- The user can tolerate session loss (e.g. an internal admin panel)
- The scale goal isn’t “huge”; it’s continuity
- There’s a solid drain + deploy discipline
Alternative patterns (more sustainable than sticky)
1) Stateless + external session store
The most common and sustainable path. But two critical notes:
- The session store now becomes “critical state” -> HA/latency/backup planning is mandatory
- Don’t skip measuring network latency and serialization cost
2) Consistent hashing (controlled affinity)
The “predictable” rather than “random” version of sticky. Especially useful for cache shards. But when the node count changes there’s a redistribution; plan for that.
3) Read/write split and a state boundary
Not every request has to be stateful. For example:
- read path stateless
- write path more controlled, narrower scope
This split dramatically reduces blast radius.
4) Graceful shutdown + connection draining
For WebSocket / long connections, sticky alone isn’t enough. Operational success usually depends on:
- when drain begins, refuse new connections, finish off existing ones
- align the LB health check with “can it accept traffic?”
- limit the number of nodes restarting concurrently during deploy
Operational checklist
Sticky or not, the answers to these questions should be written down:
- If a node dies, how does user impact spread?
- Which metrics signal a “state issue”? (disconnect rate, session error, uneven CPU/mem)
- During deploy, how many nodes can leave at once?
- What’s the rollback criterion? (SLO, error rate, disconnect spike)
Conclusion
Sticky sessions, used in the right context, can be a good transitional strategy; but most of the time they hide the debt in your state design. The sustainable production target is moving state outward, managing deploy risk via connection draining, and spreading user impact across failure scenarios. “It works” is not enough; how it behaves when it breaks must sit at the centre of the design.