Kafka: a partition log VS cluster - apache-kafka

I have read from here and a bit not sure about the partition log.
First they say:
For each topic, the Kafka cluster maintains a partitioned log that
looks like this:
Then they show a picture:
Also they say
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Do I understand correctly that :
On a cluster, it can have only one partition log of a topic? In other words, two partition of the same topic cannot be in the same cluster?
A Cluster can have multiple partition log from different topics?
The picture about a topic should be more like this?

A topic consist of 1 or many partitions. You specify the number of partitions when creating the topic, and partitions can also be added after creation.
Kafka will spread the partitions on as many brokers as it can in the cluster. If you only have a single broker then they will be all on this broker.
Many partitions from the same topic can live on the same broker. This happens all the time as most clusters only have a dozen brokers and it's not uncommon to have 50 partitions, hence several partitions from the same topic will live on the same broker.
What the docs say is that a partition is a unit that cannot be split. It's either on a broker or not. Whereas a topic is just a collections of partitions that have the same name and configuration.

To answer your question:
For a Kafka cluster of b brokers and a topic with p partitions, each broker will roughly hold p/b partitions as primary copy. They might also hold the replica partitions, but that depends on your replication factor. So, e.g. if you have a 3-node cluster, and a topic test with 6 partitions, each node will have 2 partitions.
Yes, it surely can. Extending the previous point, if you have two topics test1, and test2, each with 6 partitions, then each broker will hold 4 partitions in total (2 for each topics).
I guess in the diagram you have mislabeled brokers as cluster.

Related

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Kafka Consumer group - No of partition - No of replication

Trying to Understand the relationship between replication factor and Consumer group . Example : Number of partition = 2 Number of replication = 3 Number consumers in consumer group = 4 . In this case ,
How many consumer will receive the message ?
How This replication will impact the number of consumer to receive .
For your first question, since you have two partitions in your example, only 2 of the 4 consumers will actually get data. The other two consumers will not have any partitions assigned to them, because there aren't any partitions left for that consumer group. If you had a different consumer group, then those consumers would still be assigned partitions.
Additionally, in this case, you mention there's only a single message coming through. Depending on which partition it's assigned to, the message will only be sent to that partition. So in this case, only one of the four consumers will get the message, the one that had that partition assigned to it.
As for your second question, replication factor configuration in Kafka doesn't impact the number of messages consumers receive. Replication, as far as consumers and producers are concerned, is an internal kafka cluster detail that they don't need to worry about. As long as they're producing/consuming to/from the leader of the partition, that's all they need to know. A topic could have replication factor 2, and another one could have replication factor 10, and they would both behave identically to producers and consumers.
There's a few more details in the official Kafka documentation: https://kafka.apache.org/documentation/#theconsumer
To give some additional details on the replication factor, it doesn't have any relation whatsoever to the number of consumers receiving messages from the topic. Replication serves only one major purpose, and that is High Availability. So, let's say you have 3 brokers in a cluster, and for a topic my-topic you've set replication factor as 2. Now, if at-most one broker goes down at some point of time, you'd still be okay, as the messages are replicated in another broker for the topic.

How to change the number of brokers for a topic in a kafka cluster?

I have a problem with some Kafka topics and couldn't find an answer to it yet.
While adding more partitions to __confluent.support.metrics shouldn't be a problem (I know how to do that), I wonder if it is possible to tell it to use brokers which obviously can not be seen by this topic?
Also I'd love to understand why these topics only inherit some brokers instead of all available 5 brokers in their cluster.
I'd love to fix these topics. But I fear that if I tell it to add (or use) partitions on brokers the topic can't "see", that it might not work or even destroy the topic, which would be rather bad.
How can I instruct these topics, that there are 5 available brokers? Can I do it with one of the Kafka tools?
How could that have happened in the first place?
Why does the __consumer_offsets topic only "see" 4 brokers instead of 5 like all other topics in this cluster do?
FYI: I didn't setup any of this, but I have to cleanup/revamp the running clusters and am stuck now, I never came across this sort of problem before
The reason this has happened is because you have only one partition and one replica for the __confluent.support.metrics topic. In a 5-node cluster, this means you will only be using 20% of the available brokers in the cluster, which corresponds with the image you've posted. A topic with replication-factor 1 and 1 partition will only ever hold data on one broker.
On the other hand, it is unusual that your __consumer_offsets topic would be using only 4 out of 5 brokers. My guess would be that your 5th broker was not online at the time of creation of __consumer_offsets (this is created when you consume from any topic for the first time) and thus no partitions were created on this broker.
However, this is probably nothing to worry about, as the spread of partitions across the cluster is generally handled by Kafka itself rather than being a user problem. There is no concept of a topic "seeing" a broker per se; rather, the brokers hold the data for the topics, and the topics will know which brokers they reside on. A topic doesn't generally need to concern itself with other brokers.
Both the consumer offsets and Confluent metrics topics have line items in the server properties file that determines what configurations those topics will be created with.
To improve the health of those topics, you can attempt to increase the replication factor, which will spread your topic over more brokers and provide fault tolerance. Also see Kafka Tools Wiki

Whether kafka broker holds replication set instead of partitions?

From the Kafka documentation and from some other blogs I read, I concluded that one kafka-broker consists of one partition of topics. Here It says one Kafka-broker holds only one partition. I have only one broker in my system, But I can able to create a topic with 3 partitions and 1 replication factor. I also tried to create a topic with 3 partitions and 3 replication factor with only one broker. It throws below error
Error while executing topic command : replication factor: 3 larger than available brokers: 1
[2017-10-21 15:35:25,928] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor: 3 larger than available brokers: 1
(kafka.admin.TopicCommand$).
So I have a query.
Whether Kafka-broker holds replication instead of a partition?
If I create 3 partitions with a single broker, what happens?
In such case of 1 broker, 1 replica and 3 partition , how many partitions of single topic can kafka-broker hold?
Somebody, please explain what happens here.
The post you are reffering to doesn't say that one broker can store only one partitions. It just says that partition is not splittable between brokers (topic is). Actually I manage a brokers with thousands of partitions. So, for your questions:
Kafka brokers hold many partitions. Replication is the way to store multiple copies of partitions across cluster.
If you create topic with 3 partitions on single node cluster, the broker will hold the data for all the partitions. Replication is not possible since it requires more nodes.
All of them.
Summary: The replication factor has to be equal or less in number compared to the number of brokers you have.

Apache Kafka Scaling Topics using partitions

We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.