Consumer Group Rebalance
Kafka consumer group rebalance is the process of redistributing partitions among consumers in the same group. A rebalance is expected when the group membership changes, topic partitions change, or a consumer is considered unavailable. However, frequent or slow rebalances can reduce throughput and make consumer behavior unstable.
This document describes the impact of consumer group rebalance, common causes, and practical ways to reduce rebalance frequency.
TOC
Impact of RebalanceCommon CausesOptimization RecommendationsOptimize Consumer Startup and ShutdownTune Rebalance-Related TimeoutsAvoid Long Partition Processing TimeUse Static MembershipMonitor Rebalance ActivityBest PracticesTroubleshooting ChecklistImpact of Rebalance
Frequent rebalance can affect Kafka workloads in the following ways:
- Duplicate consumption: If a consumer leaves the group before committing offsets, another consumer can be assigned the same partitions during rebalance and reprocess the same records. Even if downstream processing is idempotent, this still wastes compute resources and increases cluster pressure.
- Group instability: A rebalance affects the whole consumer group, not only the consumer that triggered it. One unstable consumer can cause repeated partition movement across the group and delay the time required for the group to become stable again.
- Lower consumption throughput: If the group spends too much time rebalancing, the consumers spend less time processing records. In severe cases, most of the time is lost to repeated reassignment, restart, and duplicate work.
Common Causes
Kafka rebalance is triggered when partitions need to be reassigned. Common triggers include:
- Consumers joining or leaving the group
- Consumer process restarts or rolling updates
- Topic partition count changes
- Consumer heartbeat timeout
- Long message processing time that prevents the consumer from polling in time
Version-specific behavior is also important:
- Kafka clients earlier than 0.10.2: Heartbeat handling is coupled with the
poll()loop. If the application does not callpoll()for a long time, or ifpoll()processing takes too long, heartbeat timeout can occur and the group coordinator can treat the consumer as failed. - Kafka clients 0.10.2 and later: The client maintains heartbeats in a background thread, but the consumer still leaves the group if the time between two
poll()calls exceedsmax.poll.interval.ms. The default value is 5 minutes. Slow processing can therefore still trigger rebalance.
Optimization Recommendations
Optimize Consumer Startup and Shutdown
Avoid frequent consumer restarts, repeated scaling actions, or unnecessary rolling updates. If consumers are often started and stopped, the group will continuously rebalance.
For planned operations such as upgrades or scaling, use a controlled rollout strategy to minimize simultaneous membership changes in the same group.
Tune Rebalance-Related Timeouts
Adjust the following parameters according to your workload characteristics:
General guidance:
- Increase
max.poll.interval.msif record processing is slow or batch processing takes a long time. - Set
session.timeout.mshigh enough to tolerate brief pauses, but not so high that failed consumers stay undetected for too long. - Keep
heartbeat.interval.mssmaller thansession.timeout.msso the consumer can report liveness regularly.
Avoid Long Partition Processing Time
If a consumer takes too long to process records from a partition, rebalance becomes more likely.
Consider the following improvements:
- Reduce the number of records processed in a single batch
- Move expensive business logic out of the main polling path
- Use asynchronous processing where appropriate
- Increase concurrency in the application layer when the workload allows it
- Review downstream dependencies such as databases or remote APIs that slow down record handling
The main goal is to make sure the application can continue calling poll() within the expected time window.
Use Static Membership
Static membership helps reduce unnecessary rebalance when a consumer instance restarts or reconnects within a short period.
Set group.instance.id for each consumer instance so Kafka can recognize it as a stable member of the group. This reduces partition movement compared with dynamic membership.
Each consumer instance in the same group must use a unique group.instance.id.
Monitor Rebalance Activity
Monitor rebalance frequency and duration so that you can detect instability early.
Recommended practices:
- Track consumer lag together with rebalance frequency
- Monitor consumer restart count and pod restart events
- Check application logs for heartbeat timeout, poll interval timeout, and partition revocation messages
- Build dashboards that show rebalance count, rebalance duration, lag, and throughput in the same time window
If rebalance frequency increases together with lag or restart count, investigate the affected consumer group first.
Best Practices
- Avoid frequent scaling or restarts of consumers in the same group.
- Tune
max.poll.interval.ms,session.timeout.ms, andheartbeat.interval.msbased on real processing time. - Keep message processing time shorter than the configured poll interval budget.
- Use static membership with
group.instance.idfor stable long-running consumers. - Monitor rebalance behavior continuously instead of only reacting after lag becomes severe.
- Test rebalance-sensitive workloads during upgrades, rolling restarts, and partition expansion.
Troubleshooting Checklist
When rebalance happens frequently, check the following items:
- Whether consumers are restarting, scaling, or being rescheduled frequently
- Whether message processing time exceeds
max.poll.interval.ms - Whether
session.timeout.msandheartbeat.interval.msare configured appropriately - Whether the topic partition count changed recently
- Whether downstream dependency latency increased and slowed record processing
- Whether static membership is enabled for long-running consumers
After identifying the trigger, adjust the consumer configuration or processing model before increasing cluster size. Rebalance issues are usually caused by consumer behavior or configuration, not only by insufficient Kafka capacity.