Kafka out of order messages and kafka down error

Kafka out of order messages and kafka down error - apache-kafka

I was having an error cannot CREATE PARTITIONS on my service logs and I rearranged my partitions in yml file according to partition count given in KafkaDrop and restarted services. But now I'm seeing related error and after some time kafka is going down. I couldn't find solution regarding this issue. Kafka Version : 0.11
WARN Received a PartitionLeaderEpoch assignment for an epoch < latestEpoch. This implies messages have arrived out of order. New: {epoch:12, offset:323051}, Current: {epoch:29, offset208082} for Partition: __consumer_offsets-48 (kafka.server.epoch.LeaderEpochFileCache)

Related

KSQL | Consumer lag | confluent cloud |

I am using kafka confluent cloud as a message queue in the eco-system. There are 2 topics, A and B.
Messages in B arrives a little later after messages of A is being published. ( in a delay of 30 secs )
I am joining these 2 topics using ksql, ksql server is deployed in in-premises and is connected to confluent cloud. In the KSQL i am joining these 2 topics as streams based on the common identifier, say requestId and create a new stream C. C is the joined stream.
At a times, C steam shows it has generated a lag it has not processed messages of A & B.
This lag is visible in the confluent cloud UI. When i login to ksql server i could see following error and after restart of ksql server everything works fine. This happens intermittently in 2 - 3 days.
Here is my configuration in the ksql server which is deployed in in-premises.
# A comma separated list of the Confluent Cloud broker endpoints
bootstrap.servers=${bootstrap_servers}
ksql.internal.topic.replicas=3
ksql.streams.replication.factor=3
ksql.logging.processing.topic.replication.factor=3
listeners=http://0.0.0.0:8088
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="${bootstrap_auth_key}" password="${bootstrap_secret_key}";
# Schema Registry specific settings
ksql.schema.registry.basic.auth.credentials.source=USER_INFO
ksql.schema.registry.basic.auth.user.info=${schema_registry_auth_key}:${schema_registry_secret_key}
ksql.schema.registry.url=${schema_registry_url}
#Additinoal settings
ksql.streams.producer.delivery.timeout.ms=2147483647
ksql.streams.producer.max.block.ms=9223372036854775807
ksql.query.pull.enable.standby.reads=false
#ksql.streams.num.standby.replicas=3 // TODO if we need HA 1+1
#num.standby.replicas=3
# Automatically create the processing log topic if it does not already exist:
ksql.logging.processing.topic.auto.create=true
# Automatically create a stream within KSQL for the processing log:
ksql.logging.processing.stream.auto.create=true
compression.type=snappy
ksql.streams.state.dir=${base_storage_directory}/kafka-streams
Error message in the ksql server logs.
[2020-11-25 14:08:49,785] INFO stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-2] State transition from RUNNING to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:220)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] WARN stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3] Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009)
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1158)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1132)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1107)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
Edit :
During this exception. i have verified the ksql server has enough RAM and CPU

Kafka Consumer left consumer group

I've faced some problem using Kafka. Any help is much appreciated!
I have zookeeper and kafka cluster 3 nodes each in docker swarm. Kafka broker configuration you can see below.
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_NUM_PARTITIONS: 8
KAFKA_REPLICA_SOCKET_TIMEOUT_MS: 30000
KAFKA_REQUEST_TIMEOUT_MS: 30000
KAFKA_COMPRESSION_TYPE: "gzip"
KAFKA_JVM_PERFORMANCE_OPTS: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80"
KAFKA_HEAP_OPTS: "-Xmx768m -Xms768m -XX:MetaspaceSize=96m"
My case:
20x Producers producing messages to kafka topic constantly
1x Consumer reads and log messages
Kill kafka node (docker container stop) so now cluster has 2 nodes of Kafka broker (3rd will start and join cluster automatically)
And Consumer not consuming messages anymore because it left consumer group due to rebalancing
Does exist any mechanism to tell consumer to join group after rebalancing?
Logs:
INFO 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] Attempt to heartbeat failed since group is rebalancing
WARN 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

#Rostyslav Whenever we make a call by consumer to read a message, it does 2 major calls.
Poll
Commit
Poll is basically fetching records from kafka topic and commit tells kafka to save it as a read message , so that it's not read again. While polling few parameters play major role:
max_poll_records
max_poll_interval_ms
FYI: variable names are per python api.
Hence, whenever we try to read message from Consumer, it makes a poll call every max_poll_interval_ms and the same call is made only after the records fetched (as defined in max_poll_records) are processed. So, whenever, max_poll_records are not processed in max_poll_inetrval_ms, we get the error.
In order to overcome this issue, we need to alter one of the two variable. Altering max_poll_interval_ms ccan be hectic as sometime it may take longer time to process certain records, sometime lesser records. I always advice to play with max_poll_records as a fix to the issue. This works for me.

How to solve a problem with checkpointed invalid __consumer_offsets and producer epoch on partitions of __transaction_state

I have two kinds of log entries in server.log
First kind:
WARN Resetting first dirty offset of __consumer_offsets-6 to log start offset 918 since the checkpointed offset 903 is invalid. (kafka.log.LogCleanerManager$)
Second kind:
INFO [TransactionCoordinator id=3] Initialized transactionalId Source: AppService Kafka consumer -> Not empty string filter -> CDMEvent mapper -> (NonNull CDMEvent filter -> Map -> Sink: Kafka CDMEvent producer, Nullable CDMEvent filter -> Map -> Sink: Kafka Error producer)-bddeaa8b805c6e008c42fc621339b1b9-2 with producerId 78004 and producer epoch 23122 on partition __transaction_state-45 (kafka.coordinator.transaction.TransactionCoordinator)
I have found some suggestion that mentions that removing the checkpoint file might help:
https://medium.com/#anishekagarwal/kafka-log-cleaner-issues-80a05e253b8a
"What we gathered was to:
stop the broker
remove the log cleaner checkpoint file
( cleaner-offset-checkpoint )
start the broker
that solved the problem for us."
Is it safe to try that with all checkpoint files (cleaner-offset-checkpoint, log-start-offset-checkpoint, recovery-point-offset-checkpoint, replication-offset-checkpoint) or is it not recommendable at all with any of them?

I have stopped each broker and moved cleaner-offset-checkpoint to a backup location and started it without that file, brokers neatly started, deleted a lot of excessive segments and they don't log:
WARN Resetting first dirty offset of __consumer_offsets to log start offset since the checkpointed offset is invalid
any more, obviously, this issue/defect https://issues.apache.org/jira/browse/KAFKA-6266 is not solved yet, even in 2.0. 2. However, that didn't compact the consumer offset according to expectations, namely offsets.retention.minutes default is 10080 (7 days), and I tried to set it explicitely to 5040, but it didn't help, still there are messages more than one month old, since log.cleaner.enable is by default true, they should be compacted, but they are not, the only possible try is to set the cleanup.policy to delete again for the __consumer_offsets topic, but that is the action that triggered the problem, so I am a bit reluctant to do that. The problem that I described here No Kafka Consumer Group listed by kafka-consumer-groups.sh is also not resolved by that, obviously there is something preventing kafka-consumer-groups.sh to read the __consumer_offsets topic (when issued with --bootstrap-server option, otherwise it reads it from zookeeper) and display results, that's something that Kafka Tool does without problem, and I believe these two problems are connected.
And the reason why I think that topic is not compacted, is because it has messages with exactly the same key (and even timestamp), older than it should, according to broker settings. Kafka Tool also ignores certain records and doesn't interpret them as Consumer Groups in that display. Why kafka-consumer-groups.sh ignores all, that is probably due to some corruption of these records.

Kafka UNKNOWN_PRODUCER_ID exception

I sometimes find UNKNOWN_PRODUCER_ID exception when using kafka streams.
2018-06-25 10:31:38.329 WARN 1 --- [-1-1_0-producer] o.a.k.clients.producer.internals.Sender : [Producer clientId=default-groupz-7bd94946-3bc0-4400-8e73-7126b9b9c0d4-StreamThread-1-1_0-producer, transactionalId=default-groupz-1_0] Got error produce response with correlation id 1996 on topic-partition default-groupz-mplat-five-minute-stat-urlCount-counts-store-changelog-0, retrying (2147483646 attempts left). Error: UNKNOWN_PRODUCER_ID
Referred to official documents:
This exception is raised by the broker if it could not locate the
producer metadata associated with the producerId in question. This
could happen if, for instance, the producer's records were deleted
because their retention time had elapsed. Once the last records of the
producerId are removed, the producer's metadata is removed from the
broker, and future appends by the producer will return this exception.
It says one possibility is that a producer is idle for more than retention time (by default a week) so the producer's metadata will be removed from broker. Are there any other reasons that brokers could not locate producer metadata?

You might be experiencing https://issues.apache.org/jira/browse/KAFKA-7190. As it says in that ticket:
When a streams application has little traffic, then it is possible that consumer purging would delete
even the last message sent by a producer (i.e., all the messages sent by
this producer have been consumed and committed), and as a result, the broker
would delete that producer's ID. The next time when this producer tries to
send, it will get this UNKNOWN_PRODUCER_ID error code, but in this case,
this error is retriable: the producer would just get a new producer id and
retries, and then this time it will succeed.
This issue is also being tracked at https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer

Two reasons might delete your producer's metadata:
The log segments are deleted due to hitting retention time.
The producer state might get expired due to inactivity which is controlled by the setting transactional.id.expiration.ms which defaults to 7 days
So if your Kafka is < 2.4 you can workaround this by increasing the retention time(considering that your system allows that) of your topic's log(e.g 30 days) and to increase the transactional.id.expiration.ms setting( to 24 days) until KIP-360 is released:
log.retention.hours=720
transactional.id.expiration.ms=2073600000
This shall guarantee that for low-traffic topics(messages written rarely than 7 days), your producer's metadata state will remain stored in broker's memory for a longer period, thus decreasing the risk of getting UnknownProducerIdException.

Kafka unrecoverable if broker dies

We have a kafka cluster with three brokers (node ids 0,1,2) and a zookeeper setup with three nodes.
We created a topic "test" on this cluster with 20 partitions and replication factor 2. We are using Java producer API to send messages to this topic. One of the kafka broker intermittently goes down after which it is unrecoverable. To simulate the case, we killed one of the broker manually. As per the kafka arch, it is supposed to self recover, but which is not happening. When I describe the topic on the console, I see the number of ISR's reduced to one for few of the partitions as one of the broker killed. Now, whenever we are trying to push messages via the producer API (either Java client or console producer), we are encountering SocketTimeoutException.. One quick look into the logs says, "Unable to fetch the metadata"
WARN [2015-07-01 22:55:07,590] [ReplicaFetcherThread-0-3][] kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-3],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 23711; ClientId: ReplicaFetcherThread-0-3;
ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [zuluDelta,2] -> PartitionFetchInfo(11409,1048576),[zuluDelta,14] -> PartitionFetchInfo(11483,1048576).
Possible cause: java.nio.channels.ClosedChannelException
[2015-07-01 23:37:40,426] WARN Fetching topic metadata with correlation id 0 for topics [Set(test)] from broker [id:1,host:abc-0042.yy.xxx.com,port:9092] failed (kafka.client.ClientUtils$)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Any leads will be appreciated...

From your error Unable to fetch metadata it could mostly be because you could have set the bootstrap.servers in the producer to the broker that has died.
Ideally, you must have more than one broker in the bootstrap.servers list because if one of the broker fails (or is unreachable) then the other could give you the metadata.
FYI: Metadata is the information about a particular topic that tells how many number of partitions it has, their leader brokers, follower brokers etc.
So, when a key is produced to a partition, its corresponding leader broker will be the one to whom the messages will be sent to.
From your question, your ISR set has only one broker. You could try setting the bootstrap.server to this broker.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka out of order messages and kafka down error - apache-kafka

Related

KSQL | Consumer lag | confluent cloud |

Kafka Consumer left consumer group

How to solve a problem with checkpointed invalid __consumer_offsets and producer epoch on partitions of __transaction_state

Kafka UNKNOWN_PRODUCER_ID exception

Kafka unrecoverable if broker dies

Categories

Resources