I would like to inquire about the problem of not joining between stream and stream in ksqldb.
Situations and Problems
I have generated two ksqldb streams from a topic containing events from different databases (postgresql, mssql) and are joining against specific columns in both streams.
To help you understand, I will name both streams stream1, stream2, and join target columns target_col.
type
name
join target column
stream in ksqldb
stream1
taget_col
stream in ksqldb
stream2
taget_col
There is a problem that these two streams are not joined when joined with the query below.
select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes
1. Meeting the joining requirements.
According to the official ksqldb document, the conditions for co-partitioning, which are essential conditions for join, are the following three conditions, and it is confirmed that both streams satisfy the conditions.
Co-partitioning Requirements source
1. The input records for the join must have the same key schema.
-> The describe stream1 command and describe stram2 command confirmed that the join key schema of stream1 and stream2 is the same as string.
2. The input records must have the same number of partitions on both sides.
-> The partition numbers for both streams were specified the same in the statement(CREATE STREAM ~ WITH(PARTITIONS=1, ~ )) at the time of the stream declaration. The number of partitions in the source topic that the stream is subscribed to is also equal to one.
3. Both sides of the join must have the same partitioning strategy.
-> The original topic that the stream is subscribed to has one partition, so even if the partitioning strategies are different, all records are included in one partition, so it doesn't matter if the partitioning strategies are different.
2. Time difference between records.
The timestamp and partition number of the record were verified through the psuedocolumns.
The queries used are as follows: select taget_col, rowtime, rowpartition from stream1 emit changes select taget_col, rowtime, rowpartition from stream2 emit changes
When the join key column has the same value, the partition number is the same(ex. 0), and the record timestamp is not more than 2 seconds apart.
Therefore, I think the time interval(1 minutes) of the query in question(select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes) is not a problem.
3. Data Acquisition
Here's how data is obtained as a topic subscribed to by both streams.
postgresql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream1
mssql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream2
Because I use data from different databases, I utilized the appropriate jar library(mssql-jdbc-7.2.1.jre8.jar, postgresql-42.3.1.jar) for each database on the same connector.
I build kafka ecosystem using confluent official docker images.(zookeeper, broker, connect, ksqldb-server, ksqldb-cli)
In this situation, please advise if there is any action plan to solve the join problem.
Thank you.
Related
How can I join a stream with another topic with different keys?
All topics for all tables/streams below are being sourced by DB tables using Kafka connect.
My app is a Spring Cloud Stream with Kafka written as a single consumer in a group, so that I can consume all partitions per topic.
I have a stream like this:
Stream S1 from topic S has 3 partitions:
key: keyval|someval, val1: "keyval", val2: "someval"
key: keyval|someval1, val1: "keyval", val2: "someval1"
and another topic like this
T1 w/ 6 partitions:
key: tabval|keyval, val1: "tabval", val2: "keyval", val3: "someval"
key: tabval1|keyval1, val1: "tabval1", val2: "keyval", val3: "someval"
here tabval|keyval is the key that's loaded based on them being foreign keys from 2 other tables on the actual db table we load this topic from.
I have tried using a GlobalKTable for reading T1 and do a leftJoin() with stream S1 but I cannot do that as S1 doesn't have the part of the key in T1 i.e tabval
Now if I try to key T1 using keyval|someval I will only get latest write on that key when I make T1 a GlobalKtable and consume, I guess this would happen even if I just use it as a KTable?
But I need all records from T1 that match keyval|someval combo from the topic S and not just the last update.
Should I consume T1 as a table then to a stream and then rekey and merge based on a window()?
I have to consume all partitions from both topics in this consumer so the lookup works and in the end I have all data from T1
How can I achieve this please.
Thank you
Just giving you a gist of how join works in kakfa between different data abstraction (KStreame/KTable/GlobalKTable), that you need to understand before you implement the same.
Join co-partitioning requirements.
Input data must be co-partitioned when joining.
This ensures that input records with the same key, from both sides of the join, are delivered to the same stream task during processing.
It is the responsibility of the user to ensure data co-partitioning when joining.
Consider using global tables (GlobalKTable) for joining because they do not require data co-partitioning.
Why is data co-partitioning required?
KStream-KStream, KTable-KTable, and KStream-KTable joins are performed based on the keys of records (e.g., leftRecord.key == rightRecord.key).
It is required that the input streams/tables of a join are co-partitioned by key.
KStream-GlobalKTable joins is exempted as co-partitioning is not required because all partitions of the GlobalKTable‘s underlying changelog stream are made available to each KafkaStreams instance, i.e. each instance has a full copy of the changelog stream. (Ref : https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/src/main/java/io/confluent/examples/streams/GlobalKTablesExample.java)
A KeyValueMapper also allows for non-key based joins from the KStream to the GlobalKTable.
Ensuring data co-partitioning
If the inputs of a join are not co-partitioned yet, we must ensure it manually by following the below procedure.
Identify the input KStream/KTable in the join whose underlying Kafka topic has the smaller number of partitions. Let’s call this stream/table “SMALLER”, and the other side of the join “LARGER”.
Pre-create a new Kafka topic for “SMALLER” that has the same number of partitions as “LARGER”. Let’s call this new topic “repartitioned-topic-for-smaller”.
Within your application, re-write the data of “SMALLER” into the new Kafka topic. You must ensure that, when writing the data with to or through, the same partitioner is used as for “LARGER”.
Once you understand these pre-requisites clearly, match with your requirement, whether it is valid, if yes, you can find appropriate samples from here https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/README.md (Just do search with 'join')
We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!
No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements
More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTable‘s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.
Having an inner join KStream/KTable with the following sequence of messages:
table_evt_at_t1 --> stream_evt_at_t2 --> table_evt_at_t3 --> stream_evt_at_t4
the join triggers:
(stream_evt_at_t2, table_evt_at_t1) + (stream_evt_at_t4, table_evt_at_t3)
So far, everything ok.
The unexpected result comes up when I reset the stream application (with kafka-streams-application-reset.sh) and replay all the events:
(stream_evt_at_t2, table_evt_at_t3) + (stream_evt_at_t4, table_evt_at_t3)
It seems that Kafka Stream doesn't take into account the timestamps when processing the events. It populates the Ktable and then it processes the KStream getting the last value of the Ktable (table_evt_at_t3) for the two KStream events.
Note that I am using Kafka Streams 2.3.1, a custom TimestampExtractor and the property max.task.idle.ms = 10 * 1000L as [KIP-353][1] suggests
Is this the expected behaviour?
The first result that join triggers is expected behavior since KStream-KTable joins are not windowed but timestamped
The result after a 'reset'/replay is also expected behavior since KTable only keeps latest value for a given key and "table_evt_at_t3"("table_evt_at_t1" is already overwritten) is the last value
Problem: I have a table in an external Database containing kafka events I polled from the Kafka Bus the last time. The table contains for all events the composite primary key PK(topic, partition, offset).
So I can easily for every topic and partition determine the latest event.
Now I would love to do an select like this:
SELECT event
FROM topic
WHERE event.partition = partition0 AND event.offset > partition0.offset
OR event.partition = partition1 AND event.offset > partition1.offset
...
And of course I would love that the statement returns immediately with all events currently in the queue, writing the result into an HDFS-File.
How would I do that with KSQL?
N.B.: Of course I would love to put all partitions with their corresponding offsets as pairs into an array and use that in the where clause ... that would be a premium solution.
I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.
The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);