KSQL - Join unequal partitions streams - apache-kafka

How to join unequal number of partition holding streams in KSQL apart from increase the partition ?
Example Stream-1 is having the 3 partitions and Stream-2 is having the 2 partitions . In that case , of course we can increase the number partitions for Stream-1 as 3 join . But I want to know , any other method to join unequal partitioned streams through KSQL ?

No, unfortunately KStream/KSQL doesn't support join for unequal partitioned topics.
It's a pre-requisite that both topics should have same number of partitions before calling join operation otherwise it will fail.
You can read more about Co-partitioning requirement here:
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html#partition-data-to-enable-joins
To ensure co-partitioning, you can use PARTITION_BY clause to create new stream :
CREATE STREAM topic_rekeyed WITH (PARTITIONS=6) AS SELECT * FROM topic PARTITION BY topic_key;

Related

Can't join between stream and stream in ksqldb

I would like to inquire about the problem of not joining between stream and stream in ksqldb.
Situations and Problems
I have generated two ksqldb streams from a topic containing events from different databases (postgresql, mssql) and are joining against specific columns in both streams.
To help you understand, I will name both streams stream1, stream2, and join target columns target_col.
type
name
join target column
stream in ksqldb
stream1
taget_col
stream in ksqldb
stream2
taget_col
There is a problem that these two streams are not joined when joined with the query below.
select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes
1. Meeting the joining requirements.
According to the official ksqldb document, the conditions for co-partitioning, which are essential conditions for join, are the following three conditions, and it is confirmed that both streams satisfy the conditions.
Co-partitioning Requirements source
1. The input records for the join must have the same key schema.
-> The describe stream1 command and describe stram2 command confirmed that the join key schema of stream1 and stream2 is the same as string.
2. The input records must have the same number of partitions on both sides.
-> The partition numbers for both streams were specified the same in the statement(CREATE STREAM ~ WITH(PARTITIONS=1, ~ )) at the time of the stream declaration. The number of partitions in the source topic that the stream is subscribed to is also equal to one.
3. Both sides of the join must have the same partitioning strategy.
-> The original topic that the stream is subscribed to has one partition, so even if the partitioning strategies are different, all records are included in one partition, so it doesn't matter if the partitioning strategies are different.
2. Time difference between records.
The timestamp and partition number of the record were verified through the psuedocolumns.
The queries used are as follows: select taget_col, rowtime, rowpartition from stream1 emit changes select taget_col, rowtime, rowpartition from stream2 emit changes
When the join key column has the same value, the partition number is the same(ex. 0), and the record timestamp is not more than 2 seconds apart.
Therefore, I think the time interval(1 minutes) of the query in question(select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes) is not a problem.
3. Data Acquisition
Here's how data is obtained as a topic subscribed to by both streams.
postgresql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream1
mssql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream2
Because I use data from different databases, I utilized the appropriate jar library(mssql-jdbc-7.2.1.jre8.jar, postgresql-42.3.1.jar) for each database on the same connector.
I build kafka ecosystem using confluent official docker images.(zookeeper, broker, connect, ksqldb-server, ksqldb-cli)
In this situation, please advise if there is any action plan to solve the join problem.
Thank you.

Ksql - streams from topics with different partition numbers

I am trying to join messages from 2 different kafka topics which have different partition numbers with ksqlDB.
When i create streams from each topics and trying to join them, ksqlDB does not allow bec. of different partition numbers in base topics.
When i do the below steps for each topic:
->create stream from root topic,
->create another stream from first stream with new topic with 1 partition (reduce 4 to 1)
i cant' t see any data at the final stream which has 1 partition.
Is there any solution to join 2 topics with different partition numbers in ksqlDB?
I had same issue. Problem wasn't the number of partitions.
Problem was that the join was on fields with different data type ( bigint and double ).

Does Kafka Streams GlobalKTable topic require the same number of partitions as KStream topic which it will be joining with?

We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!
No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements
More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTableā€˜s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.
I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.
Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?
There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.
https://docs.confluent.io/current/streams/concepts.html#globalktable
Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html
For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.
nice reference Blog for partitioning - https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab

KStream-KStream Join vs KStream-KTable Join Performance

I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
I am making KStream-KStream join which creates 2 internal topics.
While KStream-KTable join will create 1 internal topic + 1 table.
Which is better in terms of performance and other factors ?
The choice is not a question of performance, but a question of semantics: what should the join result be? Both joins, do compute quite different results so you should pick the semantics that meet your application needs.
The different semantics are documented in CP docs and AK wiki:
https://docs.confluent.io/current/streams/developer-guide.html#joining
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics