What happens when partition Kafka is full? - apache-kafka

Does it put data to another partition that was configured before launching Kafka?
What is benefit and major reason to have more partition as usual 3?
How it affects on the reading, writing performance?

I am having a bit of trouble understanding your question but I think you may be misinterpreting what a partition actually is. A partition only has a limit if you specify it in the config, otherwise you can simply think of a partition as a separate stream of data that contains an offset. In actuality the limit on the size of a partition (or topic for that matter) is simply what the disk capacity will allow. Often the data will stay there and get deleted once a specified retention period or max size/data limit has been reached.
The list of configurations is pretty extensive but you can view it here:
https://kafka.apache.org/documentation/#configuration
As for you other question about the reasoning for having more partitions well it simply comes down to scaling. If you only have s small amount of data then a small number of partitions will be enough. With kafka there is no benefit to having multiple consumers per partitions. I.e. if you have three partitions then three consumers is the most optimal number of consumers you can have as each consumer will be assigned to a single partition. If you have more consumers, some will sit idle.
So what if we have more data and need to read it faster to avoid lag? Well we can add more partitions and by doing so we can also scale up the number of consumers.
Side note. You can have a single consumer that reads from multiple partitions but this will become an issue because reassigning partitions takes time and if you have too few consumers compared to the number of partitions they simply won't be able to deal with high loads due to the time of reassignment where no processing is happening.

Related

In Kafka, if I increase the number of partitions in a topic then will order of messages be broken? (I used a key to partition)

Recently, I started to study Kafka and have been thinking how to adopt it into my service. Some of my messages should be processed in strict order, so I chose to use a key for partitioning on producer. However, even though we just need one partition right now, we might increase the number of partitions in the near future. So, in Kafka, if I increase the number of partitions in a topic then will consumers get messages in order?
Thanks in advance.
If you increase partitions, there's no guarantee that future, equal keys will land in their prior partition, so you'll experience a temporary period, based on topic retention, where you'll have keys spanning more than one partition (by default)
One workaround is to ensure you've consumed all messages, stop all clients interacting with the topic, then empty the topic and increase the count
Or you can start with an increased count to begin with and continue having all equal keys distributed over multiple partitions

In Kafka, should you reduce the number of consumers within a group as the overall lag between the partitions drops?

I have a topic with 100 partitions. Initially, with over 1 billion messages, I had scaled 100 VMs to consume from each partition in parallel. Now the distributions seems to no longer be uniform as the number of messages is now down to just a few million. My question is.. does it now make sense to reduce the number of consuming VMs within my consumer group as the lag drops or to always keep at 100? My reasoning is, I'm wondering if a lot of rebalancing will start to occur and therefore lower my overall throughput of output messages to my sink.
Let's ignore financial cost within this decision.
As long as you don't use keyed messages in your producer, the messages should balance across your partition evenly; this applies for billions of messages, and also for millions or less.
If you use dynamic partition assignment for your consumers (which is the default) changing the number of consumers will cause rebalances and extra complication to your project structure.
As long as you take the cost out of the discussion- stay with fixed number of consumers/VMs and make sure it is a multiplication of the number of partitions, i.e 200/300/400 partitions is also alright for your case because consumers can subscribe to multiple partitions.
Remember that even for smaller number of messages, you don't lose anything by having more consumers as long as you have enough partitions to balance the work, only unused computation power (for financial considerations).

Handling a Large Kafka topic

I have a very very large(count of messages) Kafka topic, it might have more than 20M message per second, but, message size is small, it's just some plain text, each less than 1KB, I can use several partitions per topic, and also I can use several servers to work on one topic and they will consume one of the partitions in the topic...
what if I need +100 servers for a huge topic?
Is it logical to create +100 partitions or more on a single topic?
You should define "large" when mentioning Kafka topics:
Large means huge data in terms of volume size.
Message size is large that it takes time sending a message from queue to client for processing?
Intensive write to that topic? In that case, do you need to process read as fast as possible? (i.e: can we delay process data for about 1 hour)
...
In either case, you should better think on the consumer side for a better design topic and partition. For instances:
Processing time for each message is slow, and it better process fast between messages: In that case, you should create many partitions. It is like a load balancer and server relationship, you create many workers for doing your job.
If only some message types, the time processing is slow, you should consider moving to a new topic. There is a nice article: Should you put several event types in the same Kafka topic explains this decision.
Is the order of messages important? for example, message A happens before message B, message A should be processed first. In this case, you should make all messages of the same type going to the same partition (only the same partition can maintain message order), or move to a separate topic (with a single partition).
...
After you have a proper design for topic and partition, it is come to question: how many partitions should you have for each topic. Increasing total partitions will increase your throughput, but at the same time, it will affect availability or latency. There are some good topics here and here that explain carefully how will total partitions per topic affect the performance. In my opinion, you should benchmark directly on your system to choose the correct value. It depends on many factors of your system: processing power of server machine, network capacity, memory ...
And the last part, you don't need 100 servers for 100 partitions. Kafka will try to balance all partitions between servers, but it is just optional. For example, if you have 1 topic with 7 partitions running on 3 servers, there will be 2 servers store 2 partitions each and 1 server stores 3 partitions. (so 2*2 + 3*1 = 7). In the newer version of Kafka, the mapping between partition and server information will be stored on the zookeeper.
you will get better help, if you are more specific and provide some numbers like what is your expected load per second and what is each message size etc,
in general Kafka is pretty powerful and behind the seances it writes the data to buffer and periodically flush the data to disk. and as per the benchmark done by confluent a while back, Kafka cluster with 6 node supports around 0.8 million messages per second below is bench marking pic
Our friends were right, I refer you to this book
Kafka, The Definitive Guide
by Neha Narkhede, Gwen Shapira & Todd Palino
You can find the answer on page 47
How to Choose the Number of Partitions
There are several factors to consider when choosing the number of
partitions:
What is the throughput you expect to achieve for the topic?
For example, do you expect to write 100 KB per second or 1 GB per
second?
What is the maximum throughput you expect to achieve when consuming from a single partition? You will always have, at most, one consumer
reading from a partition, so if you know that your slower consumer
writes the data to a database and this database never handles more
than 50 MB per second from each thread writing to it, then you know
you are limited to 60MB throughput when consuming from a partition.
You can go through the same exercise to estimate the maxi mum throughput per producer for a single partition, but since producers
are typically much faster than consumers, it is usu‐ ally safe to skip
this.
If you are sending messages to partitions based on keys, adding partitions later can be very challenging, so calculate throughput
based on your expected future usage, not the cur‐ rent usage.
Consider the number of partitions you will place on each broker and available diskspace and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader
elections. With all this in mind, it’s clear that you want many
partitions but not too many. If you have some estimate regarding the
target throughput of the topic and the expected throughput of the con‐
sumers, you can divide the target throughput by the expected con‐
sumer throughput and derive the number of partitions this way. So if I
want to be able to write and read 1 GB/sec from a topic, and I know
each consumer can only process 50 MB/s, then I know I need at least 20
partitions. This way, I can have 20 consumers reading from the topic
and achieve 1 GB/sec. If you don’t have this detailed information, our
experience suggests that limiting the size of the partition on the
disk to less than 6 GB per day of retention often gives satisfactory
results.

Decide when to create new topic or increase partition count

Say, I am having a kafka topic with 10 partitions. When a data rate is increased, I can increase the partitions to speed up my processing logic.
But my doubt is that, whether increasing the partitions is good or can I go for topic split up (That is, Based on my application logic, some data will go for topic 1 and some data to topic2. So by doing this, I can split the data rate to two topics)
Whether choosing new topic rather than increasing partitions or increasing partitions rather than creating new topic will have any performance impact on kafka cluster?
Which one will be the best solution?
It depends!
It is usually recommended to slightly over-partition topics that are likely to increase in throughput so you don't have to add partitions when this happens.
The main reason is that if you're using keyed messages adding partitions will change the key-partition mappings. So after having added partitions, messages with a key won't go to the same partition than before. If you need ordering per key this can be problematic.
Adding partitions is usually easier as consumers and producers won't need updates. You will just be able to add consumers to scale. You also keep all events together and have to worry about a single topic. Depending on the size of your cluster, with only 10 partitions you probably still have a lot of leeway to add partitions. From Kafka's point of view, 10 partitions is pretty small and you can easily have 50 or even more.
On the other hand, when creating new topics, clients will need to be updated to use them. Nvertheless, that could be a solution if over time you start receiving more types of events and want to reorder them across several topics.

Kafka Consumer being Starved because of unbalance

I am new to Kafka and think I am missing something on how partition queues get balanced on a topic
We have 5 partitions and 2 consumers on a topic. The topic has a null key so I assume Kafka randomly picks a new partition to add the new record to in a round robin fashion.
This would mean one consumer would be reading from 3 partitions and the other 2. If my assumption is right (that the records get evenly distrusted across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
I think you should have an even divisible number of partitions to consumers.
Am I missing something?
The unit of parallelism in consuming Kafka messages is the partition. The routine scenario for consuming Kafka messages is getting messages using a data stream processing engine like Apache Flink, Spark, and Storm that all of them distributed processing on CPU cores. The rule is the maximum level of parallelism for each consumer group can be the number of partitions. Each consumer instance of a consumer group (say CPU cores) can consume one or more partitions and on the other hand, each partition can be consumed by just one consumer instance of each consumer group.
If you have more CPU core than the number of partitions, some of them
will be idle.
If you have less CPU core than the number of partitions, some of
them will consume more than one partitions.
And the optimized case is when the number of CPU cores and
Kafka partitions are equal.
The image can describe all well:
If my assumption is right (that the records get evenly distributed across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
Why would one consumer do nothing? It would still process records from those 2 partitions [assuming of course, that both the consumers are in same group]
I think you should have an even divisible number of partitions to consumers.
Yes, that's right. For maximum parallelism, you can have as many number of consumers, as the #partitions, e.g. in your case 5 consumers would give you max parallelism.
There is an assumption built into your understanding that each partition has exactly the same throughput. For most applications, though, that may or may not be true. If you set up your keying/partitioning right, then the partitions should hopefully be close to equal, especially with a large and diverse keyspace if you average them out over a large period of time. But in a more practical, realistic sense, you'll probably have some skew at any given time anyway, and your stream processing setup will need to tolerate that. So having one more partition assigned to a particular consumer is probably not going to make a big difference.
Your understanding is correct. May be there is data skew. You can check how many records are there in each partition by using offset checker or other tool.