I am trying to debug a issue for which I am trying to prove that each distinct key only goes to 1 partition if the cluster is not rebalancing.
So I was wondering for a given topic, is there a way to determine which partition a key is send to?
As explained here or also in the source code
You need the byte[] keyBytes assuming it isn't null, then using org.apache.kafka.common.utils.Utils, you can run the following.
Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
For strings or JSON, it's UTF8 encoded, and the Utils class has helper functions to get that.
For Avro, such as Confluent serialized values, it's a bit more complicated (a magic byte, then a schema ID, then the data). See Wire format
In Kafka Streams API, You should have a ProcessorContext available in your Processor#init , which you can store a reference to and then access in your Processor#process method, such as ctx.recordMetadata.get().partition() (recordMetadata returns an Optional)
only goes to 1 partition
This isn't a guarantee. Hashes can collide.
It makes more sense to say that a given key isn't in more than one partition.
if the cluster is not rebalancing
Rebalancing will still preserve a partition value.
when you send message,
Partitions are determined by the following classes
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
If you want change logics, implement org.apache.kafka.clients.producer.Partitioner interface and,
set ProduceConfig's 'partitioner.class'
reference docuement :
https://kafka.apache.org/documentation/#producerconfigs
Related
I am wondering if it is allowed to produce a message with the JSON key schema and AVRO value schema.
Is there a restriction on mix and matching schema types of the producer record?
Does the schema registry ban it?
Is it considered a bad practice?
There is no such limitation for mixing of messages. Kafka stores bytes; it doesn't care about the serialization, but a random consumer of the topic might not know how to consume the data, so using Avro (or Protobuf) for both would be a good idea
You can use UTF-8 text for key and Avro for value. You can have both as Avro. When I tried to use Kafka REST and key was not in avro format while message was, I couldn't use the consumer. Looks like it's still the case based on this issue But if you implement your own producer and consumer you decide what to encode/decode which way.
You have to be careful with keys. As based on key, messages are sent to specific partitions, in most scenarios messages should be in time order, but that can be only achieved if they go to same partition. If you have userId or any other identifier you likely want to send all events for this user to same partition, so use userId as a key. I wouldn't use json as key unless your key is based on few fields, but you have to be careful, to not end up with messages on different partitions.
I am new in Kafka and micronaut and I do not understand the usage of #KafkaKey. What I found on internet is :
The Kafka key can be specified by providing a parameter annotated with
#KafkaKey. If no such parameter is specified the record is sent with a null key.
So what exactly it means? How it will effect me if I do not use it ?
Most important effect of Kafka message keys is partitioning. For example if the key chosen was a user id then all data for a given user would be sent to the same partition. If you wouldn't specify the key of messages Kafka would use round-robin strategy for message distribution.
Kafka preserves the order within the partitions. As you specify a key for a particular message type, the message type is bound to a particular partition associated with that key. Since the order of messages is preserved in a partition, you can preserve the message order by specifying a key. This is particularly useful if you are working with state machines.
I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.
All. Forgive me I was just learning the Apache Kafka. When I was reading the document of Kafka. It mentioned a phrase named semantic partition function.
As the document says.
Producers publish data to the topics of their choice. The producer is
responsible for choosing which record to assign to which partition
within the topic. This can be done in a round-robin fashion simply to
balance load or it can be done according to some semantic partition
function (say based on some key in the record). More on the use of
partitioning in a second!
What does it mean semantic partition in Kafka? So far I didn't found any more about it in the document. Could someone please help to explain more about it for better understanding? Thanks.
When the producer doesn't specify a key for messages, the round robin fashion is used. When a key is specified, the DefaultPartitioner just process an hash on the key (module the number of partitions). If you want, you can use your own partitioner class. The documentation wants just to say that : that the "semantic" for defining the destination partition is up to you, you can develop the "function" (really a partitioner class). For example, instead of using the Kafka key in the message you could have a payload, let me say a JSON, with some data and you want to use one of this info for processing the right destination partition.
All of the examples of Kafka | producers show the ProducerRecord's key/value pair as not only being the same type (all examples show <String,String>), but the same value. For example:
producer.send(new ProducerRecord<String, String>("someTopic", Integer.toString(i), Integer.toString(i)));
But in the Kafka docs, I can't seem to find where the key/value concept (and its underlying purpose/utility) is explained. In traditional messaging (ActiveMQ, RabbitMQ, etc.) I've always fired a message at a particular topic/queue/exchange. But Kafka is the first broker that seems to require key/value pairs instead of just a regulare 'ole string message.
So I ask: What is the purpose/usefulness of requiring producers to send KV pairs?
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system.
Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message. The examples are actually not very "good" with this regard; usually you would have a complex type as value (like a tuple-type or a JSON or similar) and you would extract one field as key.
See: http://kafka.apache.org/intro#intro_topics and http://kafka.apache.org/intro#intro_producers
In general the key and/or value can be null, too. If the key is null a random partition will the selected. If the value is null it can have special "delete" semantics in case you enable log-compaction instead of log-retention policy for a topic (http://kafka.apache.org/documentation#compaction).
Late addition... Specifying the key so that all messages on the same key go to the same partition is very important for proper ordering of message processing if you will have multiple consumers in a consumer group on a topic.
Without a key, two messages on the same key could go to different partitions and be processed by different consumers in the group out of order.
Another interesting use case
We could use the key attribute in Kafka topics for sending user_ids and then can plug in a consumer to fetch streaming events (events stored in value attributes). This could allow you to process any max-history of user event sequences for creating features in your machine learning models.
I still have to find out if this is possible or not. Will keep updating my answer with further details.