Where does kafka store offsets of internal topics? - apache-kafka

__consumer_offsets store offsets of all kafka topics except internal topics such as *-changelog topics in case of streams. Where is this data stored?

The term "internal topic" has two different meanings in Kafka:
Brokers: an internal topic is a topic that the cluster uses (like __consumer_offsets). A client cannot read/write from/to this topic.
Kafka Streams: topics that Kafka Streams creates automatically, are called internal topics, too.
However, those -changelog and -repartition topics that are "internal" topics from a Kafka Streams point of view, are regular topics from a broker point of view. Hence, offsets for both are stored in __consumer_offsets like for any other topic.
Note, that Kafka Streams will only commit offsets for -repartition topics, though. For -changelog topics no offsets are ever committed (Kafka Streams does some offset tracking on the client side though, and writes -changelog offsets into a local .checkpoint file).

Related

What are the consumers for state store changelog topics

I have a topology using the processor api which updates a state store, configured with replication factor of 3, acks=ALL
Topologies:
Sub-topology: 0
Source: products-source (topics: [products])
--> products-processor
Processor: products-processor (stores: [products-store])
--> enriched-products-sink
<-- products-source
Sink: enriched-products-sink (topic: enriched.products)
<-- products-processor
My monitoring shows me very little lag for the source topic (< 100 records), however there is significant lag on the changelog topic backing the store, to the order of millions of records.
I'm trying to figure out the root cause of the lag on this changelog topic, as I'm not making any external requests in this processor. There are calls to rocksdb state stores, but these data stores are all local and should be fast in retrieving.
My question is what exactly is the consumer of this change log topic?
The consumer of the changelog topics is the restore consumer. The restore consumer is a Kafka consumer that is build into Kafka Streams. In contrast to the main consumer that reads records from the source topic, the restore consumer is responsible for restoring the local state stores from the changelog topics in case the local state is not existent or out-of-date. Basically, it ensures that the local state stores recover after a failure. The second purpose of restore consumers is to keep stand-by tasks up-to-date.
Each stream thread in a Kafka Streams client has one restore consumer. The restore consumer is not a member of a consumer group and Kafka Streams assigns changelog topics manually to restore consumer. The offsets of restore consumers are not managed in the consumer offset topic __consumer_offsets as the offsets of the main consumer but in a file in the state store directory of a Kafka Streams client.

Kafka Topic Distribution among brokers

When creating topics, can we determine which broker will be the leader for the topic? Are topics balanced across brokers in Kafka? (Considering the topics have just one partition)
Kafka does manage this internally and you don't need to worry about this in general: http://kafka.apache.org/documentation/#basic_ops_leader_balancing
If you create a new topic, Kafka will select a broker based on load. If a topic has only one partitions, it will only be hosted on a single broker (plus followers if you have multiple replicas), because a partitions cannot be split over multiple brokers in Kafka.
Nevertheless, you can get the information which broker host what topic and you can also "move" topics and partitions: http://kafka.apache.org/documentation/#basic_ops_cluster_expansion

What is the use of __consumer_offsets and _schema topics in Kafka?

After setting up the Kafka Broker cluster and creating few topics, we found that the following two topics are automatically created by Kafka:
__consumer_offsets
_schema
What is the importance and use of these topics ?
__consumer_offsets is used to store information about committed offsets for each topic:partition per group of consumers (groupID).
It is compacted topic, so data will be periodically compressed and only latest offsets information available.
_schema - is not a default kafka topic (at least at kafka 8,9). It is added by Confluent. See more: Confluent Schema Registry - github.com/confluentinc/schema-registry (thanks #serejja)
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper). When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
_schemas: This is an internal topic used by the Schema Registry which is a distributed storage layer for Avro schemas. All the information which is relevant to schema, subject (with its corresponding version), metadata and compatibility configuration is appended to this topic. The schema registry in turn, produces (e.g. when a new schema is registered under a subject) and consumes data from this topic.

Kafka consumer Behaviour

I was writing Kafka consumer and I have a query related to consumer processes.
I have a consumer with groupId="testGroupId" and using the same groupId I consume from multiple topics say, "topic1" and "topic2".
Also, assume "topic1" is already created on broker whereas "topic2" is not yet created.
Now If I start the consumer I see consumer threads for "topic1" (which is already created) in zookeeper nodes, but I do not see any consumer thread(s) for "topic2".
My question is, will the consumer thread(s) for "topic2" will be created only after we create the topic on broker?
I assume you use Kafka ConsumerConnector method like createMessageStreamsByFilter. Consumer will subscribe to kafka topic events and in case of new topics it will subscribe to that topic automatically.

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.