Kafka-how to send messages to specific partition based on a table's field value via Debezium configuration - apache-kafka

is it possible to send messages to specific partition based on a table field value? For example,i have a column called customer, which has 4 values ,say customer1,customer2,customer3,customer4. I want to send to their corresponding partition.
is it posiible to achive this in debezium configuration?

By default, Debezium will write Kafka records into partitions based on the record key, e.g. the database rows id. There's no guarantee "customer1" goes to "partition 1", or that 2 customers will end up in the same partition (e.g. you may have more customers than partitions)
To explicitly map the data to numbered partitions, you'll need to implement your own Partitioner Java interface and add it to the Connect worker classpath and set producer.override.partitioner.class in the Debezium config.
Or you can just let the producer partition based on the key of the records, as is expected.

Related

How to capture and send data to two topics with different partition key in Debezium

I want to create one single connector that sends db changes to two different topic with different message key.
So, there should be two different topics that should have same data with different message keys.
Partition keys should be gotten from event data.
Table colums are a and b.
topic-1 is use field a in the event, topic-2 is use field b in the event as message key.
How can I ensure this?

Set kafka message key to source database name in Debezium Postgresql

We are trying to collect changes from a number of Postgresql databases using Debezium.
The idea is to create a single topic with a number of partitions equal to the number of databases - each database gets its own partition, because order of events matters.
We managed to reroute events to a single topic using topic routing, but to be able to partition events by databases I need to set message key properly.
Qestion: Is there a way we can set kafka message key to be equal to the source database name?
My thougts:
Maybe there is a way to set message key globally per connector configuration?
Database name can be found in the message, but its a nested property payload.source.name. Didn't find a way to extract value from a nested propery.
Any thoughts?
Thank you in advance!
You'd need to write/find a Connect transform that can extract nested fields and set the message key, or if you don't mind duplicating data within Kafka topics, you can use Kafka Streams / KsqlDB, etc to do the same.
Overall, I don't think one topic + one partition per database is a good design for scalability of consumers. Sure, it'll keep order, but it's not much overhead to simply create one topic per database with only one partition. Then make consumers read all topics using a regex pattern rather than needing to assign to specific/all partitions in one topic.

What happens when I partition data by key and then later on add a new partition to the topic in Kafka?

What happens when I partition data by key and then later on add a new partition to the topic in Kafka?
Will there be any change to the existing record? And how will the future data be partitioned?
Partitioning of existing data doesn't change when new partitions are added to a particular topic. Kafka will not attempt to re-distribute existing records and this modification will only have effect on new records. Note that by default, Kafka partitions data using hash(key) % noOfPartitions in order to ensure that records with the same key are added to the same partition. Data with different key will be added to partitions in a round-robin fashion.

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Is it possible to create a kafka topic with dynamic partition count?

I am using kafka to stream the events of page visits by the website users to an analytics service. Each event will contain the following details for the consumer:
user id
IP address of the user
I need very high throughput, so I decided to partition the topic with partition key as userId-ipAddress
ie
For a userId 1000 and ip address 10.0.0.1, the event will have
partition key as "1000-10.0.0.1"
In this use case the partition key is dynamic, so specifying the number of partitions upfront while creating the topic.
Is it possible to create topic in kafka with dynamic partition count?
Is it a good practice to use this kind of partitioning or Is there any other way this can be achieved?
It's not possible to create a Kafka topic with dynamic partition count. When you create a topic you have to specify the number of partitions. You can change it later manually using Replication Tools.
But I don't understand why do you need dynamic partition count in the first place. The partition key is not related to the number of partitions. You can use your partition key with ten partitions or with thousand partitions. When you send a message to Kafka topic, Kafka must send it to a specific partition. Every partition is identify by it's ID which is simply a number. Kafka computes something like this
partition_id = hash(partition_key) % number_of_partition
and it sends the message to partition partition_id. If you have far more users than partitions you should be OK. More suggestions:
Use userId as a partition key. You probably don't need IP address as a part of partition key. What is it good for? Typically you need all messages from a single user to end up in a single partition. If you have IP address as a partition key then the messages from a single user could end up in multiple partitions. I don't know your use case but it general that's not what you want.
Measure how many partitions you need to process all messages. Then create let's say ten times more partitions. You can create more partitions than you actually need. Kafka won't mind and there are no performance penalties. See How to choose the number of topics/partitions in a Kafka cluster?
Right now you should be able to process all messages in your system. If traffic grows you can add more Kafka brokers and you can use Replication tools to change leaders/replicas for partitions. If the traffic grows more than ten times you must create new partitions.