I have a topology using the processor api which updates a state store, configured with replication factor of 3, acks=ALL
Topologies:
Sub-topology: 0
Source: products-source (topics: [products])
--> products-processor
Processor: products-processor (stores: [products-store])
--> enriched-products-sink
<-- products-source
Sink: enriched-products-sink (topic: enriched.products)
<-- products-processor
My monitoring shows me very little lag for the source topic (< 100 records), however there is significant lag on the changelog topic backing the store, to the order of millions of records.
I'm trying to figure out the root cause of the lag on this changelog topic, as I'm not making any external requests in this processor. There are calls to rocksdb state stores, but these data stores are all local and should be fast in retrieving.
My question is what exactly is the consumer of this change log topic?
The consumer of the changelog topics is the restore consumer. The restore consumer is a Kafka consumer that is build into Kafka Streams. In contrast to the main consumer that reads records from the source topic, the restore consumer is responsible for restoring the local state stores from the changelog topics in case the local state is not existent or out-of-date. Basically, it ensures that the local state stores recover after a failure. The second purpose of restore consumers is to keep stand-by tasks up-to-date.
Each stream thread in a Kafka Streams client has one restore consumer. The restore consumer is not a member of a consumer group and Kafka Streams assigns changelog topics manually to restore consumer. The offsets of restore consumers are not managed in the consumer offset topic __consumer_offsets as the offsets of the main consumer but in a file in the state store directory of a Kafka Streams client.
Related
I run a system comprising an InfluxDB, a Kafka Broker and data sources (sensors) producing time series data. The purpose of the broker is to protect the database from inbound event overload and as a format-agnostic platform for ingesting data. The data is transferred from Kafka to InfluxDB via Apache Camel routes.
I would like to use Kafka a intermediate message buffer in case a Camel route crashes or becomes unavailable - which is the most often error in the system. Up to now, I didn’t achieve to configure Kafka in a manner that inbound messages remain available for later consumption.
How do I configure it properly?
The messages will retain in Kafka topics based on its retention policies (you can choose between time or byte size limits) as described in the Topic Configurations. With
cleanup.policy=delete
Retention.ms=-1
the messages will in a Kafka topic will never be deleted.
Then your camel consumer will be able to re-read all messages (offsets) if you select a new consumer group or reset the offsets of the existing consumer group. Otherwise, your camel consumer might auto commit the messages (check corresponding consumer configuration) and it will not be possible to re-read offsets again for the same consumer group.
To limit the consumption rate of the camel consumer you may adjust configurations like maxPollRecords or fetchMaxBytes which are described in the docs.
__consumer_offsets store offsets of all kafka topics except internal topics such as *-changelog topics in case of streams. Where is this data stored?
The term "internal topic" has two different meanings in Kafka:
Brokers: an internal topic is a topic that the cluster uses (like __consumer_offsets). A client cannot read/write from/to this topic.
Kafka Streams: topics that Kafka Streams creates automatically, are called internal topics, too.
However, those -changelog and -repartition topics that are "internal" topics from a Kafka Streams point of view, are regular topics from a broker point of view. Hence, offsets for both are stored in __consumer_offsets like for any other topic.
Note, that Kafka Streams will only commit offsets for -repartition topics, though. For -changelog topics no offsets are ever committed (Kafka Streams does some offset tracking on the client side though, and writes -changelog offsets into a local .checkpoint file).
I have a Kafka Streams application which reads data from a few topics, joins the data and writes it to another topic.
This is the configuration of my Kafka cluster:
5 Kafka brokers
Kafka topics - 15 partitions and replication factor 3.
My Kafka Streams applications are running on the same machines as my Kafka broker.
A few million records are consumed/produced per hour. Whenever I take a broker down, the application goes into rebalancing state and after rebalancing many times it starts consuming very old messages.
Note: When the Kafka Streams application was running fine, its consumer lag was almost 0. But after rebalancing, its lag went from 0 to 10million.
Can this be because of offset.retention.minutes.
This is the log and offset retention policy configuration of my Kafka broker:
log retention policy : 3 days
offset.retention.minutes : 1 day
In the below link I read that this could be the cause:
Offset Retention Minutes reference
Any help in this would be appreciated.
Offset retention can have an impact. Cf this FAQ: https://docs.confluent.io/current/streams/faq.html#why-is-my-application-re-processing-data-from-the-beginning
Also cf How to commit manually with Kafka Stream? and How to commit manually with Kafka Stream? about how commits work.
After setting up the Kafka Broker cluster and creating few topics, we found that the following two topics are automatically created by Kafka:
__consumer_offsets
_schema
What is the importance and use of these topics ?
__consumer_offsets is used to store information about committed offsets for each topic:partition per group of consumers (groupID).
It is compacted topic, so data will be periodically compressed and only latest offsets information available.
_schema - is not a default kafka topic (at least at kafka 8,9). It is added by Confluent. See more: Confluent Schema Registry - github.com/confluentinc/schema-registry (thanks #serejja)
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper). When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
_schemas: This is an internal topic used by the Schema Registry which is a distributed storage layer for Avro schemas. All the information which is relevant to schema, subject (with its corresponding version), metadata and compatibility configuration is appended to this topic. The schema registry in turn, produces (e.g. when a new schema is registered under a subject) and consumes data from this topic.
When hdfs is not available, is there an approach to make sure the data security? The scenario is: kafka-source, flume memory-channel, hdfs-sink. What if the flume service is down, does it can store the offset of topic's partitions and consume from the right position after recovery?
Usually (with default configuration), kafka stores topic offsets for all consumers. If you start flume source with the same group id (one of consumer properties), kafka will start sending messages right from the offset of your source. But messages that has been already read from kafka and stored in your memory channel will be lost due to HDFS sink failure.