All. Forgive me I was just learning the Apache Kafka. When I was reading the document of Kafka. It mentioned a phrase named semantic partition function.
As the document says.
Producers publish data to the topics of their choice. The producer is
responsible for choosing which record to assign to which partition
within the topic. This can be done in a round-robin fashion simply to
balance load or it can be done according to some semantic partition
function (say based on some key in the record). More on the use of
partitioning in a second!
What does it mean semantic partition in Kafka? So far I didn't found any more about it in the document. Could someone please help to explain more about it for better understanding? Thanks.
When the producer doesn't specify a key for messages, the round robin fashion is used. When a key is specified, the DefaultPartitioner just process an hash on the key (module the number of partitions). If you want, you can use your own partitioner class. The documentation wants just to say that : that the "semantic" for defining the destination partition is up to you, you can develop the "function" (really a partitioner class). For example, instead of using the Kafka key in the message you could have a payload, let me say a JSON, with some data and you want to use one of this info for processing the right destination partition.
Related
As I read from Debezium's FAQs, it is said that:
Most connectors will record all events for a single database table to a single topic. Additionally, all events within a topic are totally-ordered, meaning that the order of all of those events will be maintained.
How are events for a database organized?
However, AFAIK Apache Kafka only has ordering guarantees within a single partition. So if I expect the events in a topic to be ordered, I have to set that topic having only one partition, which sacrifices the throughput of Kafka, otherwise with other mechanism. But I didn't see any explanation about this in Debezium's documentation.
My question is, how does Debezium implement the ordering guarantees within one topic? Or which module of the source code should I study to find out the detailed implementation of this feature?
to quote the answer here:
... Kafka Connect’s producer will use the default partitioning logic that computes the partition using a consistent hash of the message key, which in Debezium’s case is a struct containing the affected row’s primary/unique key...
So if the concern is that the same row/document should not be read out of order, then the concern is ruled out because the PK will always send the the event to the same partition
I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.
All of the examples of Kafka | producers show the ProducerRecord's key/value pair as not only being the same type (all examples show <String,String>), but the same value. For example:
producer.send(new ProducerRecord<String, String>("someTopic", Integer.toString(i), Integer.toString(i)));
But in the Kafka docs, I can't seem to find where the key/value concept (and its underlying purpose/utility) is explained. In traditional messaging (ActiveMQ, RabbitMQ, etc.) I've always fired a message at a particular topic/queue/exchange. But Kafka is the first broker that seems to require key/value pairs instead of just a regulare 'ole string message.
So I ask: What is the purpose/usefulness of requiring producers to send KV pairs?
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system.
Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message. The examples are actually not very "good" with this regard; usually you would have a complex type as value (like a tuple-type or a JSON or similar) and you would extract one field as key.
See: http://kafka.apache.org/intro#intro_topics and http://kafka.apache.org/intro#intro_producers
In general the key and/or value can be null, too. If the key is null a random partition will the selected. If the value is null it can have special "delete" semantics in case you enable log-compaction instead of log-retention policy for a topic (http://kafka.apache.org/documentation#compaction).
Late addition... Specifying the key so that all messages on the same key go to the same partition is very important for proper ordering of message processing if you will have multiple consumers in a consumer group on a topic.
Without a key, two messages on the same key could go to different partitions and be processed by different consumers in the group out of order.
Another interesting use case
We could use the key attribute in Kafka topics for sending user_ids and then can plug in a consumer to fetch streaming events (events stored in value attributes). This could allow you to process any max-history of user event sequences for creating features in your machine learning models.
I still have to find out if this is possible or not. Will keep updating my answer with further details.
I am trying to send the message to KafkaProducer using ProducerRecord.
new ProducerRecord(topicName,messageKey,message)
This uses DefaultPartitioner, DefaultPartitioner will use the hash of the key to ensure that all messages for the same key go to same Partition.
What is the difference between this, and using CustomPartitioner? I hope Custom Partitioner also used to send the message to same partition based on Key.
The default partitioning strategy is
If a partition is specified in the record, use it
If no partition is specified but a key is present choose a partition based on a hash of the key
If no partition or key is present choose a partition in a round-robin fashion
(This is pulled from the DefaultPartitioner source code)
The custom partitioner just lets you set your own strategy. So you could for example assign partitions randomly or if you somehow have prior knowledge of how large the partition will be assign it based off that. The default part of DefaultPartitioner is more about the round robin strategy. I'd imagine in most/all situations option 1 and 2 would be considered the norm.
I need to understand something about kafka:
When I have a single kafka broker on a single host - is there any sense to have it have more than one partition for the topics? I means even if my data can be distinguished with some key (say tenant id) - what is the benefit of doing it on a single kafka broker? does this give any parallelism , if so how?
When a key is used, is this means that each key is mapped to a given partition? Does the number of partitions for a topic must be equal to the number of possible values for the key I specified? OR is this just a hash and so the number of partitions doesnt have to be equal?
From what I read, topics are created due to types of messages to be places in kafka. But in my case, i have 2 topics I have created since I have 2 types of consumption: one for reading one by one message. the second in case of a bulk of messages comes into the queue (application reasons) and then it is being entered into the second topic. Is that a good design although the messages type is the same? any other practice for such a scansion?
Yes, it definitely makes sense to have more than one partition for a topic even when you have a single Kafka broker. A scenario when you can benefit from this is pretty simple:
you need to guarantee in-order processing by tenant id
processing logic for each message is rather complex and takes some time. Especially the case when the Kafka message itself is simple, but the logic behind processing this message takes time (simple example - message is an URL, and the processing logic is downloading the file from there and doing some processing)
Given these 2 conditions you may get into a situation where one consumer is not able to keep up processing all the messages if all the data goes to a single partition. Remember, you can process one partition with exactly one consumer (well, you can use 2 consumers if using different consumer groups, but that's not your case), so you'll start getting behind over time. But if you have more than one partition you'll either be able to use one consumer and process data in parallel (this could help to speed things up in some cases) or just add more consumers.
By default, Kafka uses hash-based partitioning. This is configurable by providing a custom Partitioner, for example you can use random partitioning if you don't care what partition your message ends up in.
It's totally up to you what purposes you have topics for
UPD, answers to questions in the comment:
Adding more consumers is usually done for adding more computing power, not for achieving desired parallelism. To add parallelism add partitions. Most consumer implementations process different partitions on different threads, so if you have enough computing power, you might just have a single consumer processing multiple partitions in parallel. Then, if you start bumping into situations where one consumer is not enough, you just add more consumers.
When you create a topic you just specify the number of partitions (and replication factor for this topic, but that's a different thing). The key and partition to send is completely up to producer. In fact, you could configure your producer to use random partitioner and it won't even care about keys, just pick the partition randomly. There's no direct relation between key -> partition, it's just convenient to benefit from having things setup like this.
Can you elaborate on this one? Not sure I understand this, but I guess your question is whether you can send just a value and Kafka will infer a key somehow itself. If so, then the answer is no - Kafka does not apply any transformation to messages and stores them as is, so if you want your message to contain a key, the producer must explicitly send the key.