Kafka consumer stuck in (Re-)joining group - apache-kafka

Whats the default behavior of kafka (version 0.10) consumer if it tries to rejoin the consumer group.
I am using a single consumer for a consumer group but it seems like it got struck at rejoining.
After each 10 min it print following line in consumer logs.
2016-08-11 13:54:53,803 INFO o.a.k.c.c.i.ConsumerCoordinator [pool-5-thread-1] ****Revoking previously assigned partitions**** [] for group image-consumer-group
2016-08-11 13:54:53,803 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] (Re-)joining group image-consumer-group
2016-08-11 14:04:53,992 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] Marking the coordinator dead for group image-consumer-group
2016-08-11 14:04:54,095 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] Discovered coordinator for group image-consumer-group.
2016-08-11 14:04:54,096 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] (Re-)joining group image-consumer-group
Restart consumer application is not helping.

If you're gonna to have only one consumer instance in a group, then use the consumer with manual assignment strategy. (Simple Consumer).
Manual topic assignment does not use the consumer's group management functionality so heart beats are not required.

Related

How to fix kafka streams problem related to group coordinator is unavailable or invalid, will attempt rediscovery

I have a problem when I try to run Kafka Streams application with PROCESSING_GUARANTEE_CONFIG that equals to "exactly once semantic" for other cases as for example at least once semantic it works very well.
I noticed in the logs that something is going wrong and I found some of the recommendation here in order to fix this problem but unfortunately it didn't helped me :(
03:35:28.627 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Discovered group coordinator kafka:9093 (id: 2147483646 rack: null)
03:35:28.627 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Group coordinator kafka:9093 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
03:35:28.628 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=to-transform] Discovered group coordinator kafka:9093 (id: 2147483646 rack: null)
03:35:28.628 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Group coordinator kafka:9093 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
03:35:48.628 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Discovered group coordinator kafka:9093 (id: 2147483646 rack: null)
03:35:48.630 INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Found no committed offset for partition topic-0
03:35:48.631 INFO o.a.k.c.c.KafkaConsumer - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-restore-consumer, groupId=null] Unsubscribed all topics or patterns and assigned partitions
03:35:48.631 INFO o.a.k.s.p.i.StreamThread - stream-thread [transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1] State transition from PARTITIONS_ASSIGNED to RUNNING
03:35:48.631 INFO o.a.k.s.KafkaStreams - stream-client [transform-f8268b2b-4673-49ac-9396-6a2b86d45697] State transition from REBALANCING to RUNNING
03:35:48.632 INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Attempt to heartbeat failed for since member id transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer-6aacbde6-4553-43ee-bc2f-2b5718e55acf is not valid.
03:35:48.632 INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Found no committed offset for partition topic-0
03:35:48.633 INFO o.a.k.c.c.i.SubscriptionState - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Resetting offset for partition topic-0 to offset 0.
03:35:48.634 INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
03:35:48.634 INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1-consumer, groupId=transform] Lost previously assigned partitions topic-0
03:35:48.634 INFO o.a.k.s.p.i.StreamThread - stream-thread [transform-f8268b2b-4673-49ac-9396-6a2b86d45697-StreamThread-1] at state RUNNING: partitions [topic-0] lost due to missed rebalance.
As for example first recommendation if I run just single kafka broker node then I have to set up partitions and replications configs to 1 as well second recommendation was to restart kafka broker that also gave no results
kafka:
image: wurstmeister/kafka:2.12-2.4.1
ports:
- "9092:9092"
- "9093:9093"
depends_on:
- zookeeper
links:
- zookeeper:zk
environment:
KAFKA_BROKER_ID: 1
KAFKA_LISTENERS: OUTSIDE://kafka:9092,INSIDE://kafka:9093
KAFKA_ADVERTISED_LISTENERS: OUTSIDE://localhost:9092,INSIDE://kafka:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT,PLAINTEXT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_LOG_RETENTION_HOURS: 1
KAFKA_MESSAGE_MAX_BYTES: 1048576
KAFKA_REPLICA_FETCH_MAX_BYTES: 1048576
KAFKA_GROUP_MAX_SESSION_TIMEOUT_MS: 30000
KAFKA_NUM_PARTITIONS: 1
KAFKA_DEFAULT_REPLICATION_FACTOR: 1
KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_NUM_PARTITIONS: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_DELETE_RETENTION_MS: 86400000
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_CREATE_TOPICS: topic:1:1, transform:1:1
Thanks for any help
kind regards, Victor
There can be many reasons for the observed issue. In general, exaclty-once is more expensive and puts a higher load on the brokers and the KafkaStreams application.
Also note, that if you really want to get exactly-once processing, you should run with at least 3 brokers (and topics should be configured with a replication factor of 3, and min-isr of 2). Otherwise, EOS cannot really be guaranteed.
Increasing the commit.interval.ms might help to mitigate the issue. Note, that for EOS, it might lead to higher processing latency (that is the reason why the default commit interval is reduced to 100ms if EOS is enable). If you can accept a higher latency, you might want to increase it to for example 1 seconds.
Also, there is a heavy investment into EOS and newer versions contain many improvements. If you can, you might want to upgrade to upcoming 2.6 release and test the new "eos_beta" processing mode (requires brokers 2.5 or newer).

Kafka Streams Topology stuck in num.stream.threads=50 and 100 partitions

There is a topology:
kStreamBuilder.stream(kafkaProperties.getInboundTopicName(), consumed)
.filterNot((k,v) -> Objects.isNull(v))
.transform(() -> new CustomTransformer(...))
.transform(() -> new AnotherTransformer(...))
.to(kafkaProperties.getOutTopicName(), resultProduced);
with configured
num.stream.threads: 50
On startup application stuck with constantly logging messages(I'm not 100% sure it stuck but after 20 minutes there are no changes in the state and CPU, network usage is very high) :
State transition from RUNNING to PARTITIONS_REVOKED
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-1-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-2-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-3-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-4-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-5-consumer, groupId=group_id] (Re-)joining group
etc.
Topic has 100 partitions.
What we noticed: every transformer uses it's own persistentStateStore. After replacing it to inMemoryStateStore there still were logs written above but after ~3 minutes topology started successfully.
Kafka streams version 2.1.0.
Broker version 1.1.0

Why does a Kafka consumer take a long time to start consuming?

We start a Kafka consumer, listening on a topic which may not yet be created (topic auto creation is enabled though).
Not long thereafter a producer is publishing messages on that topic.
However, it takes some time for the consumer to notice this: 5 minutes to be exact. At this point the consumer revokes its partitions and rejoins the consumer group. Kafka re-stabilizes the group. Looking at the time-stamps of the consumer vs. kafka logs, this process is initiated at the consumer side.
I suppose this is expected behavior but I would like to understand this. Is this actually a re-balancing going on (from 0 to 1 partition)? If we'd create topics upfront, would this not happen?
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:36:45.693 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
2017-02-01 08:36:45.738 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : Successfully joined group tps-kafka-partitioning with generation 1
2017-02-01 08:36:45.747 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Setting newly assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.749 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions assigned:[]
2017-02-01 08:41:45.540 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
kafka logs
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-partitioning with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-partitioning generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,551] INFO [GroupCoordinator 1001]: Assignment received from leader for group tps-kafka-partitioning for generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-group-id with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-group-id generation 2 (kafka.coordinator.GroupCoordinator)
This is probably due to the default value of the parameter metadata.max.age.ms which controls how often the consumer forces a refresh of metadata for a topic.
What happens when you start the consumer up with a non existing topic is that the brokers autocreate this topic, but this takes a little bit of time with leader election etc., so when your consumer requests metadata for that topic it gets a LEADER_NOT_AVAILABLE warning and can't fetch any messages.
After the timeout mentioned above is reached the consumer refreshes metadata, successfully this time around and starts reading messages. This is not dependent on a producer writing messages to the topic, it is purely a consumer thing.
If you start your consumer with for example 1000ms timeout, you should see a much shorter delay until messages are consumed.
Also, if you create topics up front, or start the producer before the consumer, this behavior should not happen at all.

kafka consumer keeps looping over a bunch of messages after CommitFailedException

I am running a multi threaded kafka 091 consumer [New].
They way i generate a client.id is using a combination of the "hostname the consumer is running on" + "AtomicInt" + "the PID of the process".
I am running into issues when I have to stop the consumer and restart. Consumer keeps trying to process the offsets that were not consumed by the previous run(about 100 of them). But it keeps failing with this message.
2016-10-21 14:22:55,293 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-10-21 14:22:55,295 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group x.cg
2016-10-21 14:22:55,296 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at com.kfc.kafka.consumer.KFCConsumer$KafkaConsumerRunner.run(KFCConsumer.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-21 14:22:55,397 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to join group x.cg failed due to unknown member id, resetting and retrying.
.........
2016-10-21 14:22:58,124 [pool-3-thread-3] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to heart beat failed since the group is rebalancing, try to re-join group.
From the kakfa log, I see a lot of rebalances happening.
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordin
Turns out we were having long, recurring pauses [slow network, problems with external components etc.] w.r.t the external components that our consumer was interacting with.
Solution was to split our consumer into three consumer with different consumer group and Kafka config's (heartbeatinterval.ms, session.timeout.ms, request.timeout.ms, maxPartitionFetchBytes).
Having 3 different consumers with custom config for the above mentioned properties helped us get rid of the above problem.
The general thinking is not to have a lot of external communication within the consumer as this increases the uncertainty in Kafka consumer behavior and when you do have external communication make sure the Kafka Consumer Config's are inline with the SLA's of the external components.

kafka consumer AbstractCoordinator: Discovered coordinator Java client

I have 3 brokers running with broker id s 0 1 and 2. The Consumer (Java Client) picks up broker 0 as group coordinator and starts to consume messages correctly. But when the broker 0 which is the group coordinator is down, the consumer does not do anything and stalls on the poll() method. The process resumes only when that broker 0 is up and running.
How to handle this scenario of group co-ordinator change in the Java client?
I get this error when group coordinator dies:
16/09/22 17:42:45 INFO internals.AbstractCoordinator: Discovered coordinator datascience1.sv2.trulia.com:9092 (id: 2147483647 rack: null) for group group2.
16/09/22 17:42:45 INFO internals.AbstractCoordinator: (Re-)joining group group2
16/09/22 17:42:45 INFO internals.AbstractCoordinator: Marking the coordinator datascience1.sv2.trulia.com:9092 (id: 2147483647 rack: null) dead for group group2
See this answer: https://stackoverflow.com/a/50954843/7321097
and https://stackoverflow.com/a/50595475/7321097
The isue is in the offsets.topic.replication.factor & replication.factor configs.