Ksql - streams from topics with different partition numbers - apache-kafka

I am trying to join messages from 2 different kafka topics which have different partition numbers with ksqlDB.
When i create streams from each topics and trying to join them, ksqlDB does not allow bec. of different partition numbers in base topics.
When i do the below steps for each topic:
->create stream from root topic,
->create another stream from first stream with new topic with 1 partition (reduce 4 to 1)
i cant' t see any data at the final stream which has 1 partition.
Is there any solution to join 2 topics with different partition numbers in ksqlDB?

I had same issue. Problem wasn't the number of partitions.
Problem was that the join was on fields with different data type ( bigint and double ).

Related

Kafka Streams Co-Partitioning is required while joining two KStreams

Recently i started reading about Kafka streams for upcoming project and stumbled upon the concept which says co-partitioning is required if we want to join two streams, all i was able to understand is if we have two Topics A and B both must have same number of partitions and for key 'X' say the partition number also must be same for both topics.
Topic A with partition A0, A1 ,A2
Topic B with partition B0, B1, B2
then message with key 'X' must be publish in A0 and B0 respectively.
Question: why partition number must be same for both topic (for 'X' key) and what issues we might faced if we have same number of partition in two topics but some of partition is idle i.e messages is not distributed evenly across partition.
When you do Kafka streaming, Kafka group consumer is used. So, your topic partitions are assigned according to Kafka partitioning strategies. Default is range assigner. read here for more.
To join Two streams, Both messages with same key should be available in same consumer instance. Otherwise your streaming consumer can not find other message to join. To make sure that, Partition number should be same for both topics and key should be same.
When partition number same for both topics, Kafka Partitioning Range Assigner makes sure that same partition assigned to same instance.
This from kafka perspective. From application side, your producer should make sure to produce messages using hash partitioner. It is the default. Then if there is same number of partition for both topics, then hashing makes sure same key should go to same partition number for both topics.
Kafka streaming Co-Partitioning is doing this to make sure when your topics has not these things.

How to ensure for Kafka Streams when listening to topics with multiple partitions that all related data is processed?

I would like to know how Kafka Streams are assigned to partitions of topics for reading.
As far as I understand it, each Kafka Stream Thread is a Consumer (and there is one Consumer Group for the Stream). So I guess the Consumers are randomly assigned to the partitions.
But how does it work, if I have multiple input topics which I want to join?
Example:
Topic P contains persons. It has two partitions. The key of the message is the person-id so each message which belongs to a person always ends up in the same partition.
Topic O contains orders. It has two partitions. Lets say the key is also the person-id (of the person who ordered something). So here, too, each order-message which belongs to a person always ends up in the same partition.
Now I have stream which which reads from both topics and counts all orders per person and writes it to another topic (where the message also includes the name of the person).
Data in topic P:
Partition 1: "hans, id=1", "maria, id=3"
Partition 2: "john, id=2"
Data in topic O:
Partition 1: "person-id=2, pizza", "person-id=3, cola"
Partition 2: "person-id=1, lasagne"
And now I start two streams.
Then this could happen:
Stream 1 is assigned to topic P partition 1 and topic O partition 1.
Stream 2 is assigned to topic P partition 2 and topic O partition 2.
This means that the order lasagne for hans would never get counted, because for that a stream would need to consume topic P partition 1 and topic O partition 2.
So how to handle that problem? I guess its fairly common that streams need to somehow process data which relates to each other. So it must be ensured that the relating data (here: hans and lasagne) is processed by the same stream.
I know this problem does not occur if there is only one stream or if the topics only have one partition. But I want to be able to concurrently process messages.
Thanks
Your use case is a KStream-KTable join where KTable store info of Users and KStream is the stream of Order, so the 2 topics have to be co-partitioned which they must have same partitions number and partitioned by the same key and Partitioner. If you're using person-id as key for kafka messages, and using the same Partitioner you should not worry about this case, cause they are on the same partition number.
Updated : As Matthias pointed out each Stream Thread has it's own Consumer instance.

Double partition property defining in KSQL

There is an example in the article https://docs.confluent.io/current/ksql/docs/developer-guide/transform-a-stream-with-ksql.html:
CREATE STREAM pageviews_transformed
WITH (TIMESTAMP='viewtime',
PARTITIONS=5,
VALUE_FORMAT='JSON') AS
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
PARTITION BY userid
EMIT CHANGES;
You can see that there is double partitions property defining. In WITH clause we define partitions count for brand new stream (topic). In GROUP BY clause - for incoming messages so as to be able to define to what partition send a message.
We created a stream with 5 partitions. Let's imagine that we have messages with 6 unique userid. In this case how will messages be distributed over that 5 partitions?
PARTITIONS is the number of Kafka topic partitions
PARTITION BY defines which kafka message key is used during record production
Let's imagine that we have messages with 6 unique userid. In this case how will messages be distributed over that 5 partitions
Via Kafka's DefaultPartioner class

KSQL - Join unequal partitions streams

How to join unequal number of partition holding streams in KSQL apart from increase the partition ?
Example Stream-1 is having the 3 partitions and Stream-2 is having the 2 partitions . In that case , of course we can increase the number partitions for Stream-1 as 3 join . But I want to know , any other method to join unequal partitioned streams through KSQL ?
No, unfortunately KStream/KSQL doesn't support join for unequal partitioned topics.
It's a pre-requisite that both topics should have same number of partitions before calling join operation otherwise it will fail.
You can read more about Co-partitioning requirement here:
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html#partition-data-to-enable-joins
To ensure co-partitioning, you can use PARTITION_BY clause to create new stream :
CREATE STREAM topic_rekeyed WITH (PARTITIONS=6) AS SELECT * FROM topic PARTITION BY topic_key;

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.
I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.
Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?
There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.
https://docs.confluent.io/current/streams/concepts.html#globalktable
Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html
For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.
nice reference Blog for partitioning - https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab