Kafka Replication - apache-kafka

We are working on the project where we wish to use Kafka. Based on our learning we have few queries:
Reference URL: https://www.youtube.com/watch?v=BGhlHsFBhLE#t=40m53s
In multiple nodes multiple brokers architecture, can consumer read from in-sync follower?
Any Kafka documentation links that gives us a walk through around such an architecture?
Kafka says that ”Producers and Consumers both write to and read from the LEADER replica and Follower replica is a High Availability solution and not meant to be read data from”
In this case, how does a same TOPIC be read from multiple brokers? Any documentation / reference links that can help me how this can be achieved?
If the concept of “LEADER / FOLLOWER” is at the partition level and topics reside within a partition, then how can a topic be read from multiple brokers (as the replication on other brokers will be a FOLLOWER replica – from which data cannot be read)?

No. Consumers always read from leaders.
I guess there is bunch of material about Kafka -- just search the Internet. Also check out http://docs.confluent.io/3.0.1/
A topic consists of one or more partitions, and partitions are distributed over the brokers. (see https://kafka.apache.org/documentation.html#intro_topics) Thus, for a single topic you can use the (at max) the same number of broker are topic partitions, to read/write date into this topic.
It is the other way round (it is not correct that "topics reside within a partition"): a topic contains multiple partitions.
Also check out this blog post about partitions and replication in Kafka: http://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/

No consumers must read just from partition leader. Replication is just for fault tolerance.
Topic is divided to partitions. Partition is a basic unit of replication and distribution. Each partition has its own leader for read and writes. You can specify layout how those partitions should be distributed across brokers.
Check out following short blog describing basic concepts.

Related

Why schema registry internal topic _schemas has only single partition?

From confluent documentation here
Kafka is used as Schema Registry storage backend. The special Kafka topic <kafkastore.topic> (default _schemas), with a single partition, is used as a highly available write ahead log..
_schemas topic is created with single partition. What is the design rational behind this? Having number of partitions more than 1 will definitely improve search for schemas by the consumers.
The schemas topic must be ordered, and it uses the default partitioner. Therefore it has one partition. There is only one consumer, anyway (the master registry server), therefore doesn't need to scale. The HTTP server can handle thousands of requests perfectly fine; the schemas are stored all in memory after consuming the topic. Consumers and producers also cache the schemas after using them once.
The replication factor of one allows for local development without editing configs. You should change this.
Kafka's own internal topics (consumer offsets and transaction topics) default to 1, as well, by the way. And num.partitions also defaults to 1 for auto-created topics.

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

How to change the number of brokers for a topic in a kafka cluster?

I have a problem with some Kafka topics and couldn't find an answer to it yet.
While adding more partitions to __confluent.support.metrics shouldn't be a problem (I know how to do that), I wonder if it is possible to tell it to use brokers which obviously can not be seen by this topic?
Also I'd love to understand why these topics only inherit some brokers instead of all available 5 brokers in their cluster.
I'd love to fix these topics. But I fear that if I tell it to add (or use) partitions on brokers the topic can't "see", that it might not work or even destroy the topic, which would be rather bad.
How can I instruct these topics, that there are 5 available brokers? Can I do it with one of the Kafka tools?
How could that have happened in the first place?
Why does the __consumer_offsets topic only "see" 4 brokers instead of 5 like all other topics in this cluster do?
FYI: I didn't setup any of this, but I have to cleanup/revamp the running clusters and am stuck now, I never came across this sort of problem before
The reason this has happened is because you have only one partition and one replica for the __confluent.support.metrics topic. In a 5-node cluster, this means you will only be using 20% of the available brokers in the cluster, which corresponds with the image you've posted. A topic with replication-factor 1 and 1 partition will only ever hold data on one broker.
On the other hand, it is unusual that your __consumer_offsets topic would be using only 4 out of 5 brokers. My guess would be that your 5th broker was not online at the time of creation of __consumer_offsets (this is created when you consume from any topic for the first time) and thus no partitions were created on this broker.
However, this is probably nothing to worry about, as the spread of partitions across the cluster is generally handled by Kafka itself rather than being a user problem. There is no concept of a topic "seeing" a broker per se; rather, the brokers hold the data for the topics, and the topics will know which brokers they reside on. A topic doesn't generally need to concern itself with other brokers.
Both the consumer offsets and Confluent metrics topics have line items in the server properties file that determines what configurations those topics will be created with.
To improve the health of those topics, you can attempt to increase the replication factor, which will spread your topic over more brokers and provide fault tolerance. Also see Kafka Tools Wiki

Kafka: a partition log VS cluster

I have read from here and a bit not sure about the partition log.
First they say:
For each topic, the Kafka cluster maintains a partitioned log that
looks like this:
Then they show a picture:
Also they say
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Do I understand correctly that :
On a cluster, it can have only one partition log of a topic? In other words, two partition of the same topic cannot be in the same cluster?
A Cluster can have multiple partition log from different topics?
The picture about a topic should be more like this?
A topic consist of 1 or many partitions. You specify the number of partitions when creating the topic, and partitions can also be added after creation.
Kafka will spread the partitions on as many brokers as it can in the cluster. If you only have a single broker then they will be all on this broker.
Many partitions from the same topic can live on the same broker. This happens all the time as most clusters only have a dozen brokers and it's not uncommon to have 50 partitions, hence several partitions from the same topic will live on the same broker.
What the docs say is that a partition is a unit that cannot be split. It's either on a broker or not. Whereas a topic is just a collections of partitions that have the same name and configuration.
To answer your question:
For a Kafka cluster of b brokers and a topic with p partitions, each broker will roughly hold p/b partitions as primary copy. They might also hold the replica partitions, but that depends on your replication factor. So, e.g. if you have a 3-node cluster, and a topic test with 6 partitions, each node will have 2 partitions.
Yes, it surely can. Extending the previous point, if you have two topics test1, and test2, each with 6 partitions, then each broker will hold 4 partitions in total (2 for each topics).
I guess in the diagram you have mislabeled brokers as cluster.

Apache Kafka Scaling Topics using partitions

We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.