Intro: The Heart of a Distributed Stream, and Where It Hurts
Apache Kafka is one of the load-bearing pieces in most of the distributed systems I work on. It moves millions of events per second, it scales sideways, and its consumer group model is what makes that scale operational. But somewhere inside that elegant model lives a mechanism that has cost me my fair share of late-night pages: consumer group rebalancing.
Rebalancing is what lets Kafka redistribute work dynamically across consumer groups. The trade-off is that it can stall the data flow, even briefly, while it happens. For low-latency or always-on workloads that pause is not free — it shows up as latency spikes, lag, and on bad days, real user impact. In this post I want to dig into the mechanics, where the pauses come from, and what I do to keep them small.
Kafka Consumer Groups: The Foundation of Distributed Consumption
In Kafka, messages are organized into topics, and each topic is split into partitions. A partition is essentially a log file — messages are appended in order, and consumers read them in that order. The piece that makes consumption distributable is the consumer group.
A consumer group is one or more consumer instances that share the work of reading from a topic. Each consumer in the group owns a subset of the topic’s partitions. That is how Kafka prevents the same message from being processed twice and spreads the workload across the group.
How Consumer Groups Actually Work
For every consumer group, Kafka picks a Group Coordinator broker. The coordinator owns the state of the group and decides how partitions are assigned to consumers. Consumers stay registered by sending heartbeats to the coordinator and committing their offsets.
Each consumer processes the partitions it has been assigned independently. Kafka will not let two consumers in the same group ID read from the same partition — that constraint is what gives the at-least-once (and, with the right setup, exactly-once) guarantee.
What Rebalancing Is and Why It Fires
Rebalancing is the process of redistributing partitions among the consumers in a group. It exists so that work stays evenly and correctly spread. The catch is that during a rebalance, every consumer in the group stops processing — which is exactly what produces the brief outages people remember rebalancing for.
What Triggers a Rebalance
A rebalance can fire from any of these events:
- A new consumer joins: When a fresh consumer instance enters the group, partitions are reshuffled so the new member gets a share. This is normally what happens when I scale a consumer up.
- A consumer leaves or dies: When an instance leaves cleanly (graceful shutdown) or unexpectedly (process kill, network blip), its partitions are reassigned to whatever’s still alive in the group.
- The topic gets more partitions: Adding partitions to a topic forces a rebalance so the new ones get assigned. Reducing partitions is not directly supported in Kafka.
- A broker restart or failure: If the broker the group coordinator lives on goes down or restarts, the coordinator role can move, and that can cascade into a rebalance.
- Missed
session.timeout.ms: If a consumer fails to heartbeat withinsession.timeout.ms, the coordinator marks it dead and kicks it out of the group, triggering a rebalance. - Missed
max.poll.interval.ms: A consumer pulls messages by callingpoll(). If it does not callpoll()again withinmax.poll.interval.ms, the coordinator decides it is stuck, ejects it from the group, and a rebalance fires.
Two Kinds of Rebalance: Eager vs Cooperative
Kafka offers more than one rebalance strategy. The historical default — Eager Rebalancing — has been largely supplanted by the more modern Cooperative Rebalancing approach.
Eager Rebalancing (Stop-the-World)
Eager rebalancing was the default in older Kafka versions, and the name says it all: when a rebalance fires, every consumer in the group revokes everything it owns. The flow:
- A rebalance is triggered (say, a new consumer joins). The coordinator sends
REVOKE_PARTITIONSto every consumer. - Every consumer stops fetching and processing on every assigned partition, and commits offsets up to that point.
- Once everyone has revoked, the coordinator builds a new assignment plan.
- The plan is shipped to consumers as
ASSIGN_PARTITIONS. - Each consumer takes over its new partitions and resumes processing.
The “stop-the-world” property means every consumer is idle for the duration of the rebalance. In groups with lots of partitions, lots of consumers, or frequent rebalances, that idle window adds up to noticeable latency and noticeable lag.
Cooperative Rebalancing (KIP-429)
Starting in Kafka 2.4, Cooperative Rebalancing (introduced via KIP-429) gives me a way around the worst of eager’s downsides. It is incremental — consumers do not all stop at once.
The key idea is that consumers only revoke the partitions that are actually moving. If a consumer’s partition is not changing hands, the consumer keeps processing it. The flow:
- When a rebalance fires, the coordinator builds a preliminary plan and only asks each consumer to revoke the partitions that are about to move (
REVOKE_PARTITIONSis scoped, not blanket). - Consumers stop processing on just those partitions, commit offsets, and keep going on the rest.
- Once the moving partitions are released, the coordinator finalizes the plan and sends out
ASSIGN_PARTITIONS. - Consumers pick up their newly-assigned partitions and start processing.
You can think of cooperative rebalancing as “two-phase”: first you release only what needs to move, then you assign the new owners. The all-stop-at-once moment goes away, and the total disruption shrinks accordingly.
What Rebalancing Pauses Cost in Production
Rebalancing has to exist — that is part of the deal with a dynamic distributed system. But when it fires, it pauses the stream, and that pause has real costs.
1. Latency Goes Up
While a rebalance is in progress, every partition in the group is paused. End-to-end latency spikes for that window. For real-time workloads — payments, sensors, anything that has to react quickly — that spike can break SLAs and trigger downstream incidents.
2. Backlog and Lag
Frequent or long rebalances reduce effective throughput. That shows up as messages piling up in the topic and consumer lag rising. Once the lag is built up, draining it takes time, and sometimes more capacity than you had budgeted for.
3. Wasted Resources
A rebalance is not just a pause — it is also extra coordination traffic between brokers and consumers, plus a window where the consumers are sitting idle on the CPU and memory I am paying for. Both sides cost something.
4. Duplicate Processing
If a consumer was processing messages but had not committed offsets yet when the rebalance fired, those messages can be reprocessed by whoever picks the partition up next. The new owner reads from the last committed offset, which means anything in flight at the moment of the rebalance gets seen again.
5. Application-Level Bugs Surface
I have seen consumers handle the revoke/assign callbacks badly and end up in weird states during a rebalance. If the application’s lifecycle for partition assignment is not correct, rebalances become a place where ugly bugs hide.
How I Keep Rebalancing Under Control
I cannot eliminate rebalancing — it is part of the deal. But I can shrink how often it fires and how much it costs when it does. There are a handful of levers I lean on.
1. Tune the Consumer Configuration
The consumer’s own settings drive a lot of rebalance behaviour. Getting them right cuts the unnecessary churn out.
session.timeout.ms: How long the coordinator waits for a heartbeat before marking the consumer dead. Too low and a transient network blip causes a rebalance; too high and a real failure goes unnoticed for too long. I usually start somewhere in the 10–30 second range.heartbeat.interval.ms: How often the consumer heartbeats. Always less thansession.timeout.ms. A third or a half of the timeout is a good default — ifsession.timeout.msis 10s, a 3s interval is reasonable.max.poll.interval.ms: The longest the consumer is allowed betweenpoll()calls. If processing is slow, you want this larger than your worst-case processing time, but not so large that a stuck consumer goes undetected. If processing is genuinely long, either optimize it or pull fewer records perpoll().max.poll.records: How many records come back from a singlepoll(). Lower values mean less work in flight at the moment of a rebalance, but also throughput cost. Pick the value that fits your processing speed and idempotency posture.
// Sample Kafka consumer configuration
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "my-consumer-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// Rebalance-related parameters
props.put("session.timeout.ms", "15000"); // 15 seconds
props.put("heartbeat.interval.ms", "5000"); // 5 seconds
props.put("max.poll.interval.ms", "300000"); // 5 minutes, for slow processing
props.put("max.poll.records", "500"); // up to 500 records per poll
props.put("enable.auto.commit", "false"); // manual offset management gives better control
// Enable cooperative rebalancing
props.put("partition.assignment.strategy", "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
2. Graceful Shutdown
When you restart or rescale, having consumers shut down gracefully — processing what’s pending, committing offsets, then exiting — makes the rebalance much cleaner than a sudden death does.
// Graceful shutdown example
public class MyKafkaConsumer implements Runnable {
private final KafkaConsumer<String, String> consumer;
private final AtomicBoolean running = new AtomicBoolean(true);
private final String topic;
public MyKafkaConsumer(Properties props, String topic) {
this.consumer = new KafkaConsumer<>(props);
this.topic = topic;
}
@Override
public void run() {
consumer.subscribe(Collections.singletonList(topic));
try {
while (running.get()) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
// Process the message
System.out.printf("Partition: %d, Offset: %d, Key: %s, Value: %s%n",
record.partition(), record.offset(), record.key(), record.value());
}
if (!records.isEmpty()) {
consumer.commitSync(); // commit offsets for processed records
}
}
} catch (WakeupException e) {
// raised when shutdown() is called
if (!running.get()) {
System.out.println("Consumer shutdown initiated.");
} else {
throw e; // unexpected wakeup, rethrow
}
} finally {
consumer.close();
System.out.println("Consumer closed.");
}
}
public void shutdown() {
running.set(false);
consumer.wakeup(); // interrupt the poll() call
}
public static void main(String[] args) throws InterruptedException {
Properties props = new Properties();
// ... (configuration as above)
props.put("enable.auto.commit", "false"); // manual commit
MyKafkaConsumer kafkaConsumer = new MyKafkaConsumer(props, "my-topic");
Thread consumerThread = new Thread(kafkaConsumer);
consumerThread.start();
// Let the application run for a while
Thread.sleep(30000);
// Graceful shutdown
System.out.println("Initiating graceful shutdown...");
kafkaConsumer.shutdown();
consumerThread.join(); // wait for the thread to finish
System.out.println("Application terminated.");
}
}
3. Use Cooperative Rebalancing
As mentioned above, switching partition.assignment.strategy to org.apache.kafka.clients.consumer.CooperativeStickyAssignor cuts the disruption per rebalance significantly. For big or dynamic groups, this is one of the highest-leverage changes I can make.
4. Static Group Membership (KIP-345)
Static Group Membership, introduced in Kafka 2.3, is a particularly nice fit for stateful consumers. With it, when a consumer instance leaves and comes back (think: a rolling restart), it reclaims the same partitions instead of triggering a full rebalance.
Wiring it up:
- Set
group.instance.idto a unique, stable value per consumer instance.
// Example configuration for static group membership
props.put("group.instance.id", "my-unique-consumer-instance-1");
5. Monitoring and Alerting
Watching rebalance behaviour and alerting on anomalies is how I catch problems before they turn into incidents.
- Kafka broker logs: The
GroupCoordinatorlogs rebalance events and reasons. Look forINFO-level entries like “Attempting to rebalance group…” or “Completing rebalance of group…” - JMX metrics: Kafka clients expose JMX metrics for rebalance behaviour. The ones in
consumer-coordinator-metrics—rebalance-total,rebalance-latency-avg,rebalance-latency-max— are the ones I watch. - Consumer lag: Sudden lag spikes are usually rebalance or stuck-consumer signals. Lag monitoring (Prometheus + Grafana, or whatever you prefer) belongs in production from day one.
- Application logs: Logging the partition lifecycle callbacks (
onPartitionsAssigned,onPartitionsRevoked) plus rebalance start/end times in the consumer makes troubleshooting much easier later.
6. Idempotent Design
Building consumers that can absorb a duplicate without harm is the cleanest way to live with the duplicate-processing risk that rebalancing carries. Same message twice, same outcome.
7. Error Handling and Retries
Consumers that handle transient errors gracefully — back off, retry, then fail loudly — do not flap themselves out of the group. A consumer that dies because a downstream call had a hiccup is a consumer that is going to trigger rebalances you did not need.
Closing Thoughts: The Steady Side of a Dynamic Stream
Kafka consumer group rebalancing is a complex but unavoidable piece of distributed streaming. It exists for good reason, and for the most part it does its job. But it can pause the stream, and for systems that depend on tight latency that pause has to be managed actively.
Understanding why rebalances fire, knowing the difference between the strategies (and reaching for cooperative wherever you can), and having a feel for what they cost in production — that is step one. From there, the right consumer configuration, graceful shutdown, static group membership, and serious monitoring will take a lot of the pain out.
There is no perfect system. But systems that are designed thoughtfully and operated carefully are the ones that hold up when something unexpected happens. Treating Kafka rebalancing as something to manage proactively — not just as something that happens to you — is one of the levers that has consistently kept my streams steady.