Why schema registry internal topic _schemas has only single partition? - apache-kafka

From confluent documentation here
Kafka is used as Schema Registry storage backend. The special Kafka topic <kafkastore.topic> (default _schemas), with a single partition, is used as a highly available write ahead log..
_schemas topic is created with single partition. What is the design rational behind this? Having number of partitions more than 1 will definitely improve search for schemas by the consumers.

The schemas topic must be ordered, and it uses the default partitioner. Therefore it has one partition. There is only one consumer, anyway (the master registry server), therefore doesn't need to scale. The HTTP server can handle thousands of requests perfectly fine; the schemas are stored all in memory after consuming the topic. Consumers and producers also cache the schemas after using them once.
The replication factor of one allows for local development without editing configs. You should change this.
Kafka's own internal topics (consumer offsets and transaction topics) default to 1, as well, by the way. And num.partitions also defaults to 1 for auto-created topics.

Related

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Manually setting Kafka consumer offset

In our project, there are Active Kafka servers( PR) and Passive Kafka servers (DR), both Kafka brokers are configured with the same group name, topic name and partition in our project. When switching from PR to DR the _consumer_offsets is manually set on DR.
My question here is, would the Kafka consumer be able to seamlessly consume the messages from where it was last read?
When replicating messages across 2 clusters, it's not possible to ensure offsets stay in sync.
For example, if a topic exists for a little while on the Active cluster the log start offset for some partitions may not be 0 (some records have been deleted by the retention policies). Hence when replicating this topic, offsets between both clusters will not be the same. This can also happen when messages are lost or duplicated as you can't have exactly once semantics when replicating between 2 clusters.
So you can't just replicate the __consumer_offsets topic, this will not work. Consumer group positions have to be explicitly "translated" between both clusters. While it's possible to reset them "manually" by directly committing, it's not recommended as finding the new positions is not obvious.
Instead, you should use a replication tool that supports "offset translation" to ensure consumers can seamlessly switch from 1 cluster to the other.
For example, Mirror Maker 2, the official Kafka tool for mirroring clusters, supports offset translation via RemoteClusterUtils. You can find the details in the KIP.
In itself, relying on the fact that both clusters will have the same offset is faulty.
Offset - is relative characteristic. It's not a part of a message. It's literally a position inside the file. And those files, Kafka log files, also rotate and have retentions. There's no guarantee that those log files are identical at any given point in time. Kafka doesn't claim to solve such an issue.
Besides, it's tricky to solve from CAP point of view.
And it's also pointless unless you want strict physical replication.
That's why Kafka multi-cluster tools are usually about logical replication. I have not used Mirror Maker(MM) but I've used Replicator(which is a more advanced commercial tool by Confluent) and it has a feature for that called, who would have guessed, just like the MM one - offset translation.
Replicator does the following:
Reads the consumer offset and timestamp information from the
__consumer_timestamps topic in the origin cluster to understand a consumer group’s progress.
Translates the committed offsets in the
origin datacenter to the corresponding offsets in the destination
datacenter.
Writes the translated offsets to the __consumer_offsets
topic in the destination cluster, as long as no consumers in that
group are connected to the destination cluster.
Note: You do need to add an interceptor to your Kafka Consumers.

Kafka Materialized Views TTL

As far as I know, Kafka by default will keep the records in the topics for 7 days and then delete them. But how about the Kafka Materialized Views, how long Kafka will keep the data there(infinitive or limited time)? Also, does Kafka replicates Materialized Views over the cluster?
Kafka topics can either be configured with retention time or with log compaction. For log compaction, the latest record for each key will never be deleted, while older record with the same key are garbage collected in regular intervals. See https://kafka.apache.org/documentation/#compaction
When Kafka Streams creates a KTable or state store and creates a changelog topic for fault-tolerance, it will create this changelog topic with log compactions enabled.
Note: if you read a topic directly as a KTable or GlobalKTable (ie, builder.table(...)), no additional changelog topic will be created but the source topic will be used for this purpose. Thus, the source topic should be configured with log compaction (and not with retention time).
You can configure the desired replication factor with StreamConfig parameter repliaction.factor. You can also manually change the replication factor at any time if you wish, eg, via bin/kafka-topics.sh command.

How to change the number of brokers for a topic in a kafka cluster?

I have a problem with some Kafka topics and couldn't find an answer to it yet.
While adding more partitions to __confluent.support.metrics shouldn't be a problem (I know how to do that), I wonder if it is possible to tell it to use brokers which obviously can not be seen by this topic?
Also I'd love to understand why these topics only inherit some brokers instead of all available 5 brokers in their cluster.
I'd love to fix these topics. But I fear that if I tell it to add (or use) partitions on brokers the topic can't "see", that it might not work or even destroy the topic, which would be rather bad.
How can I instruct these topics, that there are 5 available brokers? Can I do it with one of the Kafka tools?
How could that have happened in the first place?
Why does the __consumer_offsets topic only "see" 4 brokers instead of 5 like all other topics in this cluster do?
FYI: I didn't setup any of this, but I have to cleanup/revamp the running clusters and am stuck now, I never came across this sort of problem before
The reason this has happened is because you have only one partition and one replica for the __confluent.support.metrics topic. In a 5-node cluster, this means you will only be using 20% of the available brokers in the cluster, which corresponds with the image you've posted. A topic with replication-factor 1 and 1 partition will only ever hold data on one broker.
On the other hand, it is unusual that your __consumer_offsets topic would be using only 4 out of 5 brokers. My guess would be that your 5th broker was not online at the time of creation of __consumer_offsets (this is created when you consume from any topic for the first time) and thus no partitions were created on this broker.
However, this is probably nothing to worry about, as the spread of partitions across the cluster is generally handled by Kafka itself rather than being a user problem. There is no concept of a topic "seeing" a broker per se; rather, the brokers hold the data for the topics, and the topics will know which brokers they reside on. A topic doesn't generally need to concern itself with other brokers.
Both the consumer offsets and Confluent metrics topics have line items in the server properties file that determines what configurations those topics will be created with.
To improve the health of those topics, you can attempt to increase the replication factor, which will spread your topic over more brokers and provide fault tolerance. Also see Kafka Tools Wiki

Kafka Replication

We are working on the project where we wish to use Kafka. Based on our learning we have few queries:
Reference URL: https://www.youtube.com/watch?v=BGhlHsFBhLE#t=40m53s
In multiple nodes multiple brokers architecture, can consumer read from in-sync follower?
Any Kafka documentation links that gives us a walk through around such an architecture?
Kafka says that ”Producers and Consumers both write to and read from the LEADER replica and Follower replica is a High Availability solution and not meant to be read data from”
In this case, how does a same TOPIC be read from multiple brokers? Any documentation / reference links that can help me how this can be achieved?
If the concept of “LEADER / FOLLOWER” is at the partition level and topics reside within a partition, then how can a topic be read from multiple brokers (as the replication on other brokers will be a FOLLOWER replica – from which data cannot be read)?
No. Consumers always read from leaders.
I guess there is bunch of material about Kafka -- just search the Internet. Also check out http://docs.confluent.io/3.0.1/
A topic consists of one or more partitions, and partitions are distributed over the brokers. (see https://kafka.apache.org/documentation.html#intro_topics) Thus, for a single topic you can use the (at max) the same number of broker are topic partitions, to read/write date into this topic.
It is the other way round (it is not correct that "topics reside within a partition"): a topic contains multiple partitions.
Also check out this blog post about partitions and replication in Kafka: http://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
No consumers must read just from partition leader. Replication is just for fault tolerance.
Topic is divided to partitions. Partition is a basic unit of replication and distribution. Each partition has its own leader for read and writes. You can specify layout how those partitions should be distributed across brokers.
Check out following short blog describing basic concepts.