How does .poll method works? - apache-kafka

I am using .poll method in my application, I have alot of messages in my lag but on call of .poll why we return only few messages?. I only have one topic and 5 partitions as of now all the data goes to only one partition.
Spring-kafka 1.3.9 release
Kafka -> 1.0

You can increase maximum records fetched by increasing max.poll.records which by default is 500. Use this config to limit total records returned from a single call to poll.
all the data goes to only one partition
Well it depends on your key of message.
In case you are not providing any key, your message will be distributed randomly across partitions.
In case you are providing key, keys will be hashed and messages with the same key will go to the same partition.
https://kafka.apache.org/documentation/#consumerconfigs

Related

Kafka topic with multiple sources

If I have 1 Kafka topic with 1 partition and multiple sources are posted in the same partition. What happens if 2 servers are trying to post in the same partition at the same time? Would it mix the information between both of those servers or one of them would wait until the other finishes?
The producers will mix the messages in the partition.
As per theory, events will be guaranteed to be appended in order per partition per producer. But if we are talking about multiple producers, then the behaviour will depend on the configuration set at the producer side. In particular, max.in.flight.requests.per.connection = 1. The reason being is if there are multiple in flight events and the first one failed, the second will get appended to the log earlier, thus breaking the ordering.
Have a glance at https://blog.softwaremill.com/does-kafka-really-guarantee-the-order-of-messages-3ca849fd19d2
If somehow keys are same for both sources and every record, all of them will be recorded in the same partition (other partitions will remain empty)
If every source has a different key from each other but this key is used for every message from same source, then messages from different sources will be recorded at different partitions (if partition count is no less than source count).
If each value has a different key, regardless of sources, still kafka will mix them in partitions as I know.
In short, keys determine the partition of a message. Values with same key go to same partition. If every record has a unique key, Kafka will apply Round-Robin for incoming messages and each partition will have almost same amount of records.

In Kafka, if I increase the number of partitions in a topic then will order of messages be broken? (I used a key to partition)

Recently, I started to study Kafka and have been thinking how to adopt it into my service. Some of my messages should be processed in strict order, so I chose to use a key for partitioning on producer. However, even though we just need one partition right now, we might increase the number of partitions in the near future. So, in Kafka, if I increase the number of partitions in a topic then will consumers get messages in order?
Thanks in advance.
If you increase partitions, there's no guarantee that future, equal keys will land in their prior partition, so you'll experience a temporary period, based on topic retention, where you'll have keys spanning more than one partition (by default)
One workaround is to ensure you've consumed all messages, stop all clients interacting with the topic, then empty the topic and increase the count
Or you can start with an increased count to begin with and continue having all equal keys distributed over multiple partitions

Autoscaling with KAFKA and non-transactional databases

Say, I have an application that reads a batch of data from KAFKA, it uses the keys of the incoming messages and makes a query to HBase (reads the current data from HBase for those keys), does some computation and writes data back to HBase for the same set of keys. For e.g.
{K1, V1}, {K2, V2}, {K3, V3} (incoming messages from KAFKA) --> My Application (Reads the current value of K1, K2 and K3 from HBase, uses the incoming value V1, V2 and V3 does some compute and writes the new values for K1 (V1+x), K2 (V2+y) and K3(V3+z) back to HBase after the processing is complete.
Now, let’s say I have one partition for the KAFKA topic and 1 consumer. My application has one consumer thread that is processing the data.
The problem is that say HBase goes down, at which point my application stops processing messages, and a huge lag builds into KAFKA. Even, though I have the ability to increase the number of partitions and correspondingly the consumers, I cannot increase either of them because of RACE conditions in HBase. HBase doesn’t support row level locking so now if I increase the number of partitions the same key could go to two different partitions and correspondingly to two different consumers who may end up in a RACE condition and whoever writes last is the winner. I will have to wait till all the messages gets processed before I can increase the number of partitions.
For e.g.
HBase goes down --> Initially I have one partition for the topic and there is unprocessed message --> {K3, V3} in partition 0 --> now I increase the number of partitions and message with key K3 is now present let’s say in partition 0 and 1 --> then consumer consuming from partition 0 and another consumer consuming from partition 1 will end up competing to write to HBase.
Is there a solution to the problem? Of course locking the key K3 by the consumer processing the message is not the solution since we are dealing with Big Data.
When you increase a number of partitions only new messages come to the newly added partitions. Kafka takes responsibility for processing one message exactly once
A message will only appear in one and only one kafka partition. It is using a hash function on the message modulo the number of partitions. I believe this guarantee solves your problem.
But bear in mind that if you change the number of partitions the same message key could be allocated to a different partition. That may matter if you care about the ordering of messages that is only guaranteed per partition. If you care about the ordering of messages repartitioning (e.g. increasing the number of partitions) is not an option.
As Vassilis mentioned, Kafka guarantee that single key will be only in one partition.
There are different strategies how to distribute keys on partitions.
When you increase partition number or change partitioning strategy, a rebalance process could occur which may affect to working consumers. If you stop consumers for a while, you could avoid possibility of processing the same key by two consumers.

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Why is data not evenly distributed among partitions when a partitioning key is not specified?

Is this explanation still valid in Kafka 10?
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. So, if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. To alleviate this problem, one can either reduce the metadata refresh interval or specify a message key and a customized random partitioner. For more detail see this thread http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E
From here https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified?
The new producer has changed to use round-robin policy. That's to say, messages will be delivered to all partitions evenly if no keys are specified.