What is the use of __consumer_offsets and _schema topics in Kafka? - apache-kafka

After setting up the Kafka Broker cluster and creating few topics, we found that the following two topics are automatically created by Kafka:
__consumer_offsets
_schema
What is the importance and use of these topics ?

__consumer_offsets is used to store information about committed offsets for each topic:partition per group of consumers (groupID).
It is compacted topic, so data will be periodically compressed and only latest offsets information available.
_schema - is not a default kafka topic (at least at kafka 8,9). It is added by Confluent. See more: Confluent Schema Registry - github.com/confluentinc/schema-registry (thanks #serejja)

__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper). When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
_schemas: This is an internal topic used by the Schema Registry which is a distributed storage layer for Avro schemas. All the information which is relevant to schema, subject (with its corresponding version), metadata and compatibility configuration is appended to this topic. The schema registry in turn, produces (e.g. when a new schema is registered under a subject) and consumes data from this topic.

Related

What are the consumers for state store changelog topics

I have a topology using the processor api which updates a state store, configured with replication factor of 3, acks=ALL
Topologies:
Sub-topology: 0
Source: products-source (topics: [products])
--> products-processor
Processor: products-processor (stores: [products-store])
--> enriched-products-sink
<-- products-source
Sink: enriched-products-sink (topic: enriched.products)
<-- products-processor
My monitoring shows me very little lag for the source topic (< 100 records), however there is significant lag on the changelog topic backing the store, to the order of millions of records.
I'm trying to figure out the root cause of the lag on this changelog topic, as I'm not making any external requests in this processor. There are calls to rocksdb state stores, but these data stores are all local and should be fast in retrieving.
My question is what exactly is the consumer of this change log topic?
The consumer of the changelog topics is the restore consumer. The restore consumer is a Kafka consumer that is build into Kafka Streams. In contrast to the main consumer that reads records from the source topic, the restore consumer is responsible for restoring the local state stores from the changelog topics in case the local state is not existent or out-of-date. Basically, it ensures that the local state stores recover after a failure. The second purpose of restore consumers is to keep stand-by tasks up-to-date.
Each stream thread in a Kafka Streams client has one restore consumer. The restore consumer is not a member of a consumer group and Kafka Streams assigns changelog topics manually to restore consumer. The offsets of restore consumers are not managed in the consumer offset topic __consumer_offsets as the offsets of the main consumer but in a file in the state store directory of a Kafka Streams client.

Where does kafka store offsets of internal topics?

__consumer_offsets store offsets of all kafka topics except internal topics such as *-changelog topics in case of streams. Where is this data stored?
The term "internal topic" has two different meanings in Kafka:
Brokers: an internal topic is a topic that the cluster uses (like __consumer_offsets). A client cannot read/write from/to this topic.
Kafka Streams: topics that Kafka Streams creates automatically, are called internal topics, too.
However, those -changelog and -repartition topics that are "internal" topics from a Kafka Streams point of view, are regular topics from a broker point of view. Hence, offsets for both are stored in __consumer_offsets like for any other topic.
Note, that Kafka Streams will only commit offsets for -repartition topics, though. For -changelog topics no offsets are ever committed (Kafka Streams does some offset tracking on the client side though, and writes -changelog offsets into a local .checkpoint file).

Kafka Topic Distribution among brokers

When creating topics, can we determine which broker will be the leader for the topic? Are topics balanced across brokers in Kafka? (Considering the topics have just one partition)
Kafka does manage this internally and you don't need to worry about this in general: http://kafka.apache.org/documentation/#basic_ops_leader_balancing
If you create a new topic, Kafka will select a broker based on load. If a topic has only one partitions, it will only be hosted on a single broker (plus followers if you have multiple replicas), because a partitions cannot be split over multiple brokers in Kafka.
Nevertheless, you can get the information which broker host what topic and you can also "move" topics and partitions: http://kafka.apache.org/documentation/#basic_ops_cluster_expansion

Kafka Connect offset.storage.topic not receiving messages (i.e. how to access Kafka Connect offset metadata?)

I am working on setting up a Kafka Connect Distributed Mode application which will be a Kafka to S3 pipeline. I am using Kafka 0.10.1.0-1 and Kafka Connect 3.1.1-1. So far things are going smoothly but one aspect that is important to the larger system I am working with requires knowing offset information of the Kafka -> FileSystem pipeline. According to the documentation, the offset.storage.topic configuration will be the location the distributed mode application uses for storing offset information. This makes sense given how Kafka stores consumer offsets in the 'new' Kafka. However, after doing some testing with the FileStreamSinkConnector, nothing is being written to my offset.storage.topic which is the default value: connect-offsets.
To be specific, I am using a Python Kafka producer to push data to a topic and using Kafka Connect with the FileStreamSinkConnect to output the data from the topic to a file. This works and behaves as I expect the connector to behave. Additionally, when I stop the connector and start the connector, the application remembers the state in the topic and there is no data duplication. However, when I go to the offset.storage.topic to see what offset metadata is stored, there is nothing in the topic.
This is the command that I use:
kafka-console-consumer --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 --topic connect-offsets --from-beginning
I receive this message after letting this command run for a minute or so:
Processed a total of 0 messages
So to summarize, I have 2 questions:
Why is offset metadata not being written to the topic that should be storing this even though my distributed application is keeping state correctly?
How do I access offset metadata information for a Kafka Connect distributed mode application? This is 100% necessary for my team's Lambda Architecture implementation of our system.
Thanks for the help.
Liju is correct, connect-offsets is used to track offsets for source connectors (which have a producer but not a consumer). Sink connector have a consumer and track offsets the usual way - __consumer_offsets topic
The best way to look at last committed offsets is with the consumer group tool:
bin/kafka-consumer-groups.sh --group connect-elastic-login-connector --bootstrap-server localhost:9092 --describe
The group name is always "connect-" and the connector name (in my case, elastic-login-connector). This will show the latest offset committed by the group, which basically acknowledges that all messages up to this offset were written to Elastic.
The offsets might be committing to the kafka default offset commit topic i.e. _consumer_offsets
The new S3 Connector released by Confluent might be of interested to you.
From what you describe, maybe it can significantly simplify your goal of exporting records from Kafka to your S3 buckets.

Flume use hdfs sink. How to ensure data integrity when hdfs is not available?

When hdfs is not available, is there an approach to make sure the data security? The scenario is: kafka-source, flume memory-channel, hdfs-sink. What if the flume service is down, does it can store the offset of topic's partitions and consume from the right position after recovery?
Usually (with default configuration), kafka stores topic offsets for all consumers. If you start flume source with the same group id (one of consumer properties), kafka will start sending messages right from the offset of your source. But messages that has been already read from kafka and stored in your memory channel will be lost due to HDFS sink failure.