We start a Kafka consumer, listening on a topic which may not yet be created (topic auto creation is enabled though).
Not long thereafter a producer is publishing messages on that topic.
However, it takes some time for the consumer to notice this: 5 minutes to be exact. At this point the consumer revokes its partitions and rejoins the consumer group. Kafka re-stabilizes the group. Looking at the time-stamps of the consumer vs. kafka logs, this process is initiated at the consumer side.
I suppose this is expected behavior but I would like to understand this. Is this actually a re-balancing going on (from 0 to 1 partition)? If we'd create topics upfront, would this not happen?
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:36:45.693 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
2017-02-01 08:36:45.738 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : Successfully joined group tps-kafka-partitioning with generation 1
2017-02-01 08:36:45.747 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Setting newly assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.749 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions assigned:[]
2017-02-01 08:41:45.540 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
kafka logs
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-partitioning with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-partitioning generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,551] INFO [GroupCoordinator 1001]: Assignment received from leader for group tps-kafka-partitioning for generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-group-id with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-group-id generation 2 (kafka.coordinator.GroupCoordinator)
This is probably due to the default value of the parameter metadata.max.age.ms which controls how often the consumer forces a refresh of metadata for a topic.
What happens when you start the consumer up with a non existing topic is that the brokers autocreate this topic, but this takes a little bit of time with leader election etc., so when your consumer requests metadata for that topic it gets a LEADER_NOT_AVAILABLE warning and can't fetch any messages.
After the timeout mentioned above is reached the consumer refreshes metadata, successfully this time around and starts reading messages. This is not dependent on a producer writing messages to the topic, it is purely a consumer thing.
If you start your consumer with for example 1000ms timeout, you should see a much shorter delay until messages are consumed.
Also, if you create topics up front, or start the producer before the consumer, this behavior should not happen at all.
Related
There is a topology:
kStreamBuilder.stream(kafkaProperties.getInboundTopicName(), consumed)
.filterNot((k,v) -> Objects.isNull(v))
.transform(() -> new CustomTransformer(...))
.transform(() -> new AnotherTransformer(...))
.to(kafkaProperties.getOutTopicName(), resultProduced);
with configured
num.stream.threads: 50
On startup application stuck with constantly logging messages(I'm not 100% sure it stuck but after 20 minutes there are no changes in the state and CPU, network usage is very high) :
State transition from RUNNING to PARTITIONS_REVOKED
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-1-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-2-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-3-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-4-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-5-consumer, groupId=group_id] (Re-)joining group
etc.
Topic has 100 partitions.
What we noticed: every transformer uses it's own persistentStateStore. After replacing it to inMemoryStateStore there still were logs written above but after ~3 minutes topology started successfully.
Kafka streams version 2.1.0.
Broker version 1.1.0
We have 9 microservices that creates 32 topics (2 of them from beginning , 30 of them from internal) , after I make a new join kafka gets down. Is there any limitation that only 32 topics can be created with Kafka, or how can I solve this?
Thank you for your time.
Started SpringBootCounterMS in 4.652 seconds (JVM running for 7.59)
2018-07-04 10:39:29.513 INFO 14956 --- [-StreamThread-1] o.a.k.s.p.internals.StreamThread : stream-thread [counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7-StreamThread-1] State transition from PARTITIONS_ASSIGNED to RUNNING
2018-07-04 10:39:29.514 INFO 14956 --- [-StreamThread-1] org.apache.kafka.streams.KafkaStreams : stream-client [counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7]State transition from REBALANCING to RUNNING
2018-07-04 10:39:30.579 INFO 14956 --- [-StreamThread-1] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7-StreamThread-1-consumer, groupId=counter-service] Marking the coordinator localhost:9092 (id: 2147483647 rack: null) dead
2018-07-04 10:39:30.599 WARN 14956 --- [-StreamThread-2] o.a.k.s.p.i.InternalTopicManager : stream-thread [main] Could not create internal topics: Empty response for client request. Retry #0
Updating JVM to a 64bit one solves this problem. It is a outofmemory error.
I'm using kafka 1.1.0. A kafka stream consistently throws this exception (albeit with different messages)
WARN o.a.k.s.p.i.RecordCollectorImpl#onCompletion:166 - task [0_0] Error sending record (key KEY value VALUE timestamp TIMESTAMP) to topic OUTPUT_TOPIC due to Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.; No more records will be sent and no more offsets will be recorded for this task.
WARN o.a.k.s.p.i.AssignedStreamsTasks#closeZombieTask:202 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] stream task 0_0 got migrated to another thread already. Closing it as zombie.
WARN o.a.k.s.p.internals.StreamThread#runLoop:752 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now.
org.apache.kafka.streams.errors.TaskMigratedException: StreamsTask taskId: 0_0
ProcessorTopology:
KSTREAM-SOURCE-0000000000:
topics:
[INPUT_TOPIC]
children: [KSTREAM-PEEK-0000000001]
KSTREAM-PEEK-0000000001:
children: [KSTREAM-MAP-0000000002]
KSTREAM-MAP-0000000002:
children: [KSTREAM-SINK-0000000003]
KSTREAM-SINK-0000000003:
topic:
OUTPUT_TOPIC
Partitions [INPUT_TOPIC-0]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:238)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: task [0_0] Abort sending since producer got fenced with a previous record
I'm not sure what is causing this exception. When I restart application it appears to successfully process a few records before failing with the same exception. Strangely enough, the records are successfully processed several times even though the stream is set to exactly once processing. Here is the stream configuration:
Properties streamProperties = new Properties();
streamProperties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, service.getName());
streamProperties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
//Should be DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG - but that field is private.
streamProperties.put("default.production.exception.handler", ErrorHandler.class);
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl);
streamProperties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
streamProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
streamProperties.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
streamProperties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
Out of the three servers, only two generate relevant logs when restarting the streams application. Here are logs from the first server:
[2018-05-09 14:42:14,635] INFO [GroupCoordinator 1]: Member INPUT_TOPIC-09dd8ac8-2cd6-4dd1-b963-63ea804c8fcc-StreamThread-1-consumer-3fedb398-91fe-480a-b5ee-1b5879d0956c in group INPUT_TOPIC has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 1 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Group INPUT_TOPIC with generation 2 is now empty (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 2 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Stabilized group INPUT_TOPIC generation 3 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,871] INFO [GroupCoordinator 1]: Assignment received from leader for group INPUT_TOPIC for generation 3 (kafka.coordinator.group.GroupCoordinator)
And from the second server:
[2018-05-09 14:42:16,228] INFO [TransactionCoordinator id=0] Initialized transactionalId INPUT_TOPIC-0_0 with producerId 2010 and producer epoch 37 on partition __transaction_state-37 (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:22,121] INFO [TransactionCoordinator id=0] Completed rollback ongoing transaction of transactionalId: INPUT_TOPIC-0_0 due to timeout (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:42,263] ERROR [ReplicaManager broker=0] Error processing append operation on partition OUTPUT_TOPIC-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 37 (request epoch), 38 (server epoch)
It appears like the first server sees that the consumer has failed and removes it from the consumer group before it is registered with the second server. Any ideas what could be causing the consumer to fail? Or, any ideas handling this failure gracefully? It's possible that it is this bug, does anyone know of a possible workaround?
I'm not sure what caused the problem, but reducing the max.poll.records to 1 fixed the problem.
I am running a multi threaded kafka 091 consumer [New].
They way i generate a client.id is using a combination of the "hostname the consumer is running on" + "AtomicInt" + "the PID of the process".
I am running into issues when I have to stop the consumer and restart. Consumer keeps trying to process the offsets that were not consumed by the previous run(about 100 of them). But it keeps failing with this message.
2016-10-21 14:22:55,293 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-10-21 14:22:55,295 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group x.cg
2016-10-21 14:22:55,296 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at com.kfc.kafka.consumer.KFCConsumer$KafkaConsumerRunner.run(KFCConsumer.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-21 14:22:55,397 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to join group x.cg failed due to unknown member id, resetting and retrying.
.........
2016-10-21 14:22:58,124 [pool-3-thread-3] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to heart beat failed since the group is rebalancing, try to re-join group.
From the kakfa log, I see a lot of rebalances happening.
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordin
Turns out we were having long, recurring pauses [slow network, problems with external components etc.] w.r.t the external components that our consumer was interacting with.
Solution was to split our consumer into three consumer with different consumer group and Kafka config's (heartbeatinterval.ms, session.timeout.ms, request.timeout.ms, maxPartitionFetchBytes).
Having 3 different consumers with custom config for the above mentioned properties helped us get rid of the above problem.
The general thinking is not to have a lot of external communication within the consumer as this increases the uncertainty in Kafka consumer behavior and when you do have external communication make sure the Kafka Consumer Config's are inline with the SLA's of the external components.
Whats the default behavior of kafka (version 0.10) consumer if it tries to rejoin the consumer group.
I am using a single consumer for a consumer group but it seems like it got struck at rejoining.
After each 10 min it print following line in consumer logs.
2016-08-11 13:54:53,803 INFO o.a.k.c.c.i.ConsumerCoordinator [pool-5-thread-1] ****Revoking previously assigned partitions**** [] for group image-consumer-group
2016-08-11 13:54:53,803 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] (Re-)joining group image-consumer-group
2016-08-11 14:04:53,992 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] Marking the coordinator dead for group image-consumer-group
2016-08-11 14:04:54,095 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] Discovered coordinator for group image-consumer-group.
2016-08-11 14:04:54,096 INFO o.a.k.c.c.i.AbstractCoordinator [pool-5-thread-1] (Re-)joining group image-consumer-group
Restart consumer application is not helping.
If you're gonna to have only one consumer instance in a group, then use the consumer with manual assignment strategy. (Simple Consumer).
Manual topic assignment does not use the consumer's group management functionality so heart beats are not required.