Double partition property defining in KSQL - apache-kafka

There is an example in the article https://docs.confluent.io/current/ksql/docs/developer-guide/transform-a-stream-with-ksql.html:
CREATE STREAM pageviews_transformed
WITH (TIMESTAMP='viewtime',
PARTITIONS=5,
VALUE_FORMAT='JSON') AS
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
PARTITION BY userid
EMIT CHANGES;
You can see that there is double partitions property defining. In WITH clause we define partitions count for brand new stream (topic). In GROUP BY clause - for incoming messages so as to be able to define to what partition send a message.
We created a stream with 5 partitions. Let's imagine that we have messages with 6 unique userid. In this case how will messages be distributed over that 5 partitions?

PARTITIONS is the number of Kafka topic partitions
PARTITION BY defines which kafka message key is used during record production
Let's imagine that we have messages with 6 unique userid. In this case how will messages be distributed over that 5 partitions
Via Kafka's DefaultPartioner class

Related

Ksql - streams from topics with different partition numbers

I am trying to join messages from 2 different kafka topics which have different partition numbers with ksqlDB.
When i create streams from each topics and trying to join them, ksqlDB does not allow bec. of different partition numbers in base topics.
When i do the below steps for each topic:
->create stream from root topic,
->create another stream from first stream with new topic with 1 partition (reduce 4 to 1)
i cant' t see any data at the final stream which has 1 partition.
Is there any solution to join 2 topics with different partition numbers in ksqlDB?
I had same issue. Problem wasn't the number of partitions.
Problem was that the join was on fields with different data type ( bigint and double ).

How to ensure for Kafka Streams when listening to topics with multiple partitions that all related data is processed?

I would like to know how Kafka Streams are assigned to partitions of topics for reading.
As far as I understand it, each Kafka Stream Thread is a Consumer (and there is one Consumer Group for the Stream). So I guess the Consumers are randomly assigned to the partitions.
But how does it work, if I have multiple input topics which I want to join?
Example:
Topic P contains persons. It has two partitions. The key of the message is the person-id so each message which belongs to a person always ends up in the same partition.
Topic O contains orders. It has two partitions. Lets say the key is also the person-id (of the person who ordered something). So here, too, each order-message which belongs to a person always ends up in the same partition.
Now I have stream which which reads from both topics and counts all orders per person and writes it to another topic (where the message also includes the name of the person).
Data in topic P:
Partition 1: "hans, id=1", "maria, id=3"
Partition 2: "john, id=2"
Data in topic O:
Partition 1: "person-id=2, pizza", "person-id=3, cola"
Partition 2: "person-id=1, lasagne"
And now I start two streams.
Then this could happen:
Stream 1 is assigned to topic P partition 1 and topic O partition 1.
Stream 2 is assigned to topic P partition 2 and topic O partition 2.
This means that the order lasagne for hans would never get counted, because for that a stream would need to consume topic P partition 1 and topic O partition 2.
So how to handle that problem? I guess its fairly common that streams need to somehow process data which relates to each other. So it must be ensured that the relating data (here: hans and lasagne) is processed by the same stream.
I know this problem does not occur if there is only one stream or if the topics only have one partition. But I want to be able to concurrently process messages.
Thanks
Your use case is a KStream-KTable join where KTable store info of Users and KStream is the stream of Order, so the 2 topics have to be co-partitioned which they must have same partitions number and partitioned by the same key and Partitioner. If you're using person-id as key for kafka messages, and using the same Partitioner you should not worry about this case, cause they are on the same partition number.
Updated : As Matthias pointed out each Stream Thread has it's own Consumer instance.

Kafka Consumer partition mapping

I have 100 consumers in same group listening to same topic and 100 partition. So as per the documentation each consumer should only listen to one partition since there are 100 consumers and 100 partitions. I produce the message to kafka using a key. So some message with the same key should go in the same partition and should always be consumed by the same consumer in the group. But in my case multiple messages with the same key are consumed multiple consumers randomly. Any way to do that all messages from a partition are consumed by only one specific consumer in the group. I do not want to explicitly assign partition to consumers.
Verify that your message partitioning is working as expected from the producer side
If you have 100 consumers using same consumer group id for a 100 partitions topic , each consumer will get exactly 1 partition to consume from.

PartitionId in kafka replicated partition

I am implementing kafka producer with single topic with multiple partitions. I am choosing to which partition a message goes by a particular value (feedName property value in message json ) in message. I am maintaining an SQL table for the feedName - partitionId mapping. My questions is Will the partition Id will be same for leader as well as replicas ?
If different how can I identify a partition uniquely across all brokers?
Partition ID is same across the brokers. If not, would get real confusing.
Partition IDs are maintained in Zookeeper, and all brokers have access to Zookeeper. This is what it's used for -- so all the brokers have the same view of Topics and Partions (and brokers, for that matter).
Partition id is immutable message sequence. You can find the same in Kafka documentation
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
In your use case you no need to worry about mapping of id and feedName.
Hope this helps!

KafKa partitioner class, assign message to partition within topic using key

I am new to kafka so apology if I sound stupid but what I understood so far
is .. A stream of message can be defined as a topic, like a category. And every topic is divided
into one or more partitions (each partition can have multiple replicas). so they act in parallel
From the Kafka main site they say
The producer is able to chose which message to assign to which partition within the topic.
This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).
Does this mean while consuming I will be able to choose the message offset from particular partition?
While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
In Kafka 0.7 quick start they say
Send a message with a partition key. Messages with the same key are sent to the same partition.
And the key can be provided while creating the producer as below
ProducerData<String, String> data = new ProducerData<String, String>("test-topic", "test-key", "test-message");
producer.send(data);
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
While creating producer in 0.8beta we can provide the partitioner class attribute through the config file.
The custom partitioner class can be perhaps created implementing the kafka partitioner interface.
But m little confused how exactly it works. 0.8 doc also does not explain much. Any advice or m i missing something ?
This is what I've found so far ..
Define our own custom partitioner class by implementing the kafka Partitioner interface. The implemented method will have two arguments, first the key that we provide from the producer and next the number of partition available. So we can define our own logic to set which key of message goes to what partition.
Now while creating the producer we can specify our own partitioner class using the "partitioner.class" attribute
props.put("partitioner.class", "path.to.custom.partitioner.class");
If we don't mention it then Kafka will use its default class and try to distribute message evenly among the partitions available.
Also inform Kafka how to serialize the key
props.put("key.serializer.class", "kafka.serializer.StringEncoder");
Now if we send some message using a key in the producer the message will be delivered to a specific partition (based on our logic written on the custom partitioner class), and in the consumer (SimpleConsumer) level we can specify the partition to retrieve the specific messages.
In case we need to pass a String as a key, the same should be handled in the custom partitioner class ( take hash value of the key and then take first two digit etc )
Each topic in Kafka is split into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream (say C1 uses 3, C2 uses 3 and C3 uses 4).
In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Does this mean while consuming I will be able to choose the message offset from particular partition? While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
Yes you can choose message from one specific partition from your consumer but if you want that to be identified dynamically then it depends on the logic how you have implemented Partitioner Class in your producer.
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
There are two way of consuming the message. One is using Zookeeper Host and another is Static Host. Zookeper host consumes message from all partition. However if you are uisng Static Host than you can provide broker with partition number that needs to be consumed.
Please check below example of Kafka 0.8
Producer
KeyedMessage<String, String> data = new KeyedMessage<String, String>(<<topicName>>, <<KeyForPartition>>, <<Message>>);
Partition Class
public int partition(Object arg0, int arg1) {
// arg0 is the key given while producing, arg1 is the number of
// partition the broker has
long organizationId = Long.parseLong((String) arg0);
// if the given key is less than the no of partition available then send
// it according to the key given Else send it to the last partition
if (arg1 < organizationId) {
return (arg1 - 1);
}
// return (int) (organizationId % arg1);
return Integer.parseInt((String) arg0);
}
So the partiotioner class decide where to send message based on your logic.
Consumer (PN:I have used Storm Kafka 0.8 integration)
HostPort hosts = new HostPort("10.**.**.***",9092);
GlobalPartitionInformation gpi = new GlobalPartitionInformation();
gpi.addPartition(0, hosts);
gpi.addPartition(2, hosts);
StaticHosts statHost = new StaticHosts(gpi);
SpoutConfig spoutConf = new SpoutConfig(statHost, <<topicName>>, "/kafkastorm", <<spoutConfigId>>);