How to capture and send data to two topics with different partition key in Debezium - debezium

I want to create one single connector that sends db changes to two different topic with different message key.
So, there should be two different topics that should have same data with different message keys.
Partition keys should be gotten from event data.
Table colums are a and b.
topic-1 is use field a in the event, topic-2 is use field b in the event as message key.
How can I ensure this?

Related

Set kafka message key to source database name in Debezium Postgresql

We are trying to collect changes from a number of Postgresql databases using Debezium.
The idea is to create a single topic with a number of partitions equal to the number of databases - each database gets its own partition, because order of events matters.
We managed to reroute events to a single topic using topic routing, but to be able to partition events by databases I need to set message key properly.
Qestion: Is there a way we can set kafka message key to be equal to the source database name?
My thougts:
Maybe there is a way to set message key globally per connector configuration?
Database name can be found in the message, but its a nested property payload.source.name. Didn't find a way to extract value from a nested propery.
Any thoughts?
Thank you in advance!
You'd need to write/find a Connect transform that can extract nested fields and set the message key, or if you don't mind duplicating data within Kafka topics, you can use Kafka Streams / KsqlDB, etc to do the same.
Overall, I don't think one topic + one partition per database is a good design for scalability of consumers. Sure, it'll keep order, but it's not much overhead to simply create one topic per database with only one partition. Then make consumers read all topics using a regex pattern rather than needing to assign to specific/all partitions in one topic.

Kafka-how to send messages to specific partition based on a table's field value via Debezium configuration

is it possible to send messages to specific partition based on a table field value? For example,i have a column called customer, which has 4 values ,say customer1,customer2,customer3,customer4. I want to send to their corresponding partition.
is it posiible to achive this in debezium configuration?
By default, Debezium will write Kafka records into partitions based on the record key, e.g. the database rows id. There's no guarantee "customer1" goes to "partition 1", or that 2 customers will end up in the same partition (e.g. you may have more customers than partitions)
To explicitly map the data to numbered partitions, you'll need to implement your own Partitioner Java interface and add it to the Connect worker classpath and set producer.override.partitioner.class in the Debezium config.
Or you can just let the producer partition based on the key of the records, as is expected.

How to sync data for a particular user, when reading from kafka?

I have a streaming serving using kafka, where I receive data from multiple users and I want to process the data where each users data must be processed in sync manner where as different User's data data can be processed on async manner? Is there any standard pattern available for such scenarios or situations ?
You can achieve so, by using userId as the key while publishing the message to kafka.
Keys are used to ensure the messages published to kafka with a particular key are ordered by pushing them into a single partition.
And as each consumer is assigned one partition (in best case, i.e. there can't be any such case where one partition is shared among consumers), thus consumer would be consuming the data from partition in sequence it is pushed.

Two different types of partitions in kafka producer

In Kafka producer, I am sending two different sets of data. I have two partitions for the topic. The first one is with a key and the second one is without a key. As far as I know the key is used to make partitions for the data. If the key is absent, null will be sent and the partition will be happening by round-robin scheduling.
But the question is if I am sending the data with and without key alternatively for some particular period of time, what will happen?
Will round robin scheduling happen for the partitions excluding the partition made by using key or will it happen for the all the two partitions?
Kafka select partition as per defined below rules
If used Custom Partitioner then partitioner will get selected based on Custom Partitioner logic.
If no Custom Partitioner then Kafka uses DefaultPartitioner
a. if the key is null then partition selected on round-robin.
b. If the key is non-null keys then It uses Murmur2 hash with modulo to identify partitions for the topic.
So message with key (null or not null) would get published on both partitions using Default Partitioner with no Custom Partitioner defined.
To achieve a message publish in a specific partition you can use the below method.
Pass partition explicitly while publishing a message
/**
* Creates a record to be sent to a specified topic and partition
*/
public ProducerRecord(String topic, Integer partition, K key, V value) {
this(topic, partition, null, key, value, null);
}
You can create Custom Partitioner and implement logic to select the partition
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/Partitioner.html
I want to correct you. You said that the key is used to make partitions for the data. The key with a message is basically sent to get the message ordering for a specific field.
If key=null, data is sent round-robin (to a different partition and to a different broker in a distributed env. and of course to the same topic.).
If a key is sent, then all messages for that key will always go to the same partition.
Explain and example
key can be any string or integer, etc.. take an example of an integer employee_id as key.
So emplyee_id 123 will always go to partition 0, employee_id 345 will always go to partition 1. This is decided by the key hashing algorithm which depends on the number of partitions.
if you don't send any key then the message can go to any partition using a round-robin technique.
Kafka has a very organized scenario when it comes to sending and storing the records in the partitions. As you have mentioned, the Key is used for the purpose that the same key records go to the same partition. This helps in maintaining the chronology of those messages on that topic.
In your case, the two partitions will store the data as:
Partition 1: Store the data which contains a particular key with it. The records with this key will always go to this Partition. This is the concept of Custom Partitioning. Apart from this, the key with null values will also go to this partition as it follows the Round Robin Fashion to store the records
Partition 2: This partition will contain records which are entered without any key. i.e the key is null.

What's the purpose of Kafka's key/value pair-based messaging?

All of the examples of Kafka | producers show the ProducerRecord's key/value pair as not only being the same type (all examples show <String,String>), but the same value. For example:
producer.send(new ProducerRecord<String, String>("someTopic", Integer.toString(i), Integer.toString(i)));
But in the Kafka docs, I can't seem to find where the key/value concept (and its underlying purpose/utility) is explained. In traditional messaging (ActiveMQ, RabbitMQ, etc.) I've always fired a message at a particular topic/queue/exchange. But Kafka is the first broker that seems to require key/value pairs instead of just a regulare 'ole string message.
So I ask: What is the purpose/usefulness of requiring producers to send KV pairs?
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system.
Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message. The examples are actually not very "good" with this regard; usually you would have a complex type as value (like a tuple-type or a JSON or similar) and you would extract one field as key.
See: http://kafka.apache.org/intro#intro_topics and http://kafka.apache.org/intro#intro_producers
In general the key and/or value can be null, too. If the key is null a random partition will the selected. If the value is null it can have special "delete" semantics in case you enable log-compaction instead of log-retention policy for a topic (http://kafka.apache.org/documentation#compaction).
Late addition... Specifying the key so that all messages on the same key go to the same partition is very important for proper ordering of message processing if you will have multiple consumers in a consumer group on a topic.
Without a key, two messages on the same key could go to different partitions and be processed by different consumers in the group out of order.
Another interesting use case
We could use the key attribute in Kafka topics for sending user_ids and then can plug in a consumer to fetch streaming events (events stored in value attributes). This could allow you to process any max-history of user event sequences for creating features in your machine learning models.
I still have to find out if this is possible or not. Will keep updating my answer with further details.