Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics - left-join

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.

The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);

Related

Can't join between stream and stream in ksqldb

I would like to inquire about the problem of not joining between stream and stream in ksqldb.
Situations and Problems
I have generated two ksqldb streams from a topic containing events from different databases (postgresql, mssql) and are joining against specific columns in both streams.
To help you understand, I will name both streams stream1, stream2, and join target columns target_col.
type
name
join target column
stream in ksqldb
stream1
taget_col
stream in ksqldb
stream2
taget_col
There is a problem that these two streams are not joined when joined with the query below.
select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes
1. Meeting the joining requirements.
According to the official ksqldb document, the conditions for co-partitioning, which are essential conditions for join, are the following three conditions, and it is confirmed that both streams satisfy the conditions.
Co-partitioning Requirements source
1. The input records for the join must have the same key schema.
-> The describe stream1 command and describe stram2 command confirmed that the join key schema of stream1 and stream2 is the same as string.
2. The input records must have the same number of partitions on both sides.
-> The partition numbers for both streams were specified the same in the statement(CREATE STREAM ~ WITH(PARTITIONS=1, ~ )) at the time of the stream declaration. The number of partitions in the source topic that the stream is subscribed to is also equal to one.
3. Both sides of the join must have the same partitioning strategy.
-> The original topic that the stream is subscribed to has one partition, so even if the partitioning strategies are different, all records are included in one partition, so it doesn't matter if the partitioning strategies are different.
2. Time difference between records.
The timestamp and partition number of the record were verified through the psuedocolumns.
The queries used are as follows: select taget_col, rowtime, rowpartition from stream1 emit changes select taget_col, rowtime, rowpartition from stream2 emit changes
When the join key column has the same value, the partition number is the same(ex. 0), and the record timestamp is not more than 2 seconds apart.
Therefore, I think the time interval(1 minutes) of the query in question(select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes) is not a problem.
3. Data Acquisition
Here's how data is obtained as a topic subscribed to by both streams.
postgresql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream1
mssql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream2
Because I use data from different databases, I utilized the appropriate jar library(mssql-jdbc-7.2.1.jre8.jar, postgresql-42.3.1.jar) for each database on the same connector.
I build kafka ecosystem using confluent official docker images.(zookeeper, broker, connect, ksqldb-server, ksqldb-cli)
In this situation, please advise if there is any action plan to solve the join problem.
Thank you.

What is the best practice to enrich data when synchronizing data with Kafka Connect

I am thinking about solutions to enrich data from Kafka.
Now I am using implementing Mongo Kafka Connect to sync all changes to Kafka. The kafka connect use the change stream to watch oplogs and public changes to Kafka. Relationship between Mongo's collection and Kafka Topic is 1:1.
On the consumer side, when it pulls data, it will get the reference id that we need to join to other collection to get the data.
To join data between collections, I have 2 solutions below.
when pulling data by consumers, it need to go back to the Mongo database to fetch or the data or join collections according to the reference key.
For this way, I concern about the number of connects that I need to go back to the Mongo database.
using kafka streaming to join data among topics.
For the second solution, I like to know how to keep that master data in the topics forever and how to maintain records in topics like db tables, so each row have unique index, and when data changes come to the topic, we can update the records.
If you have any other solutions, please let me know.
Your consumer can do whatever it wants. You may need to increase various Kafka timeout configs depending on your database lookups, though.
Kafka topics can be infinitely retained with retention.ms=-1, or by compaction. When you use compaction, it'll act similarly to a KV store (but as a log). To get an actual lookup store, you can build a KTable, then join a topic stream against it
This page covers various join patterns in Kafka Streams - https://developer.confluent.io/learn-kafka/kafka-streams/joins/
You can also use ksqlDB

Enrich kafka stream from different topics using different keys

How can I join a stream with another topic with different keys?
All topics for all tables/streams below are being sourced by DB tables using Kafka connect.
My app is a Spring Cloud Stream with Kafka written as a single consumer in a group, so that I can consume all partitions per topic.
I have a stream like this:
Stream S1 from topic S has 3 partitions:
key: keyval|someval, val1: "keyval", val2: "someval"
key: keyval|someval1, val1: "keyval", val2: "someval1"
and another topic like this
T1 w/ 6 partitions:
key: tabval|keyval, val1: "tabval", val2: "keyval", val3: "someval"
key: tabval1|keyval1, val1: "tabval1", val2: "keyval", val3: "someval"
here tabval|keyval is the key that's loaded based on them being foreign keys from 2 other tables on the actual db table we load this topic from.
I have tried using a GlobalKTable for reading T1 and do a leftJoin() with stream S1 but I cannot do that as S1 doesn't have the part of the key in T1 i.e tabval
Now if I try to key T1 using keyval|someval I will only get latest write on that key when I make T1 a GlobalKtable and consume, I guess this would happen even if I just use it as a KTable?
But I need all records from T1 that match keyval|someval combo from the topic S and not just the last update.
Should I consume T1 as a table then to a stream and then rekey and merge based on a window()?
I have to consume all partitions from both topics in this consumer so the lookup works and in the end I have all data from T1
How can I achieve this please.
Thank you
Just giving you a gist of how join works in kakfa between different data abstraction (KStreame/KTable/GlobalKTable), that you need to understand before you implement the same.
Join co-partitioning requirements.
Input data must be co-partitioned when joining.
This ensures that input records with the same key, from both sides of the join, are delivered to the same stream task during processing.
It is the responsibility of the user to ensure data co-partitioning when joining.
Consider using global tables (GlobalKTable) for joining because they do not require data co-partitioning.
Why is data co-partitioning required?
KStream-KStream, KTable-KTable, and KStream-KTable joins are performed based on the keys of records (e.g., leftRecord.key == rightRecord.key).
It is required that the input streams/tables of a join are co-partitioned by key.
KStream-GlobalKTable joins is exempted as co-partitioning is not required because all partitions of the GlobalKTable‘s underlying changelog stream are made available to each KafkaStreams instance, i.e. each instance has a full copy of the changelog stream. (Ref : https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/src/main/java/io/confluent/examples/streams/GlobalKTablesExample.java)
A KeyValueMapper also allows for non-key based joins from the KStream to the GlobalKTable.
Ensuring data co-partitioning
If the inputs of a join are not co-partitioned yet, we must ensure it manually by following the below procedure.
Identify the input KStream/KTable in the join whose underlying Kafka topic has the smaller number of partitions. Let’s call this stream/table “SMALLER”, and the other side of the join “LARGER”.
Pre-create a new Kafka topic for “SMALLER” that has the same number of partitions as “LARGER”. Let’s call this new topic “repartitioned-topic-for-smaller”.
Within your application, re-write the data of “SMALLER” into the new Kafka topic. You must ensure that, when writing the data with to or through, the same partitioner is used as for “LARGER”.
Once you understand these pre-requisites clearly, match with your requirement, whether it is valid, if yes, you can find appropriate samples from here https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/README.md (Just do search with 'join')

Kafka Joins - How can we achive it?

I have below topics in Kafka(10 partitions each) and want to join them as given in below example? Could someone please advise how this can be done?
Topic:"orders"
Topic: "referencedata"
Joined Topic:"merged_ref_orders"
Basically, we are loading all of the orders and reference data into a hash map and then joining them. This is causing lots of performance issue when there are lots of orders. I came through something such as KTable/GlobalKTable but not sure how does it operate internally.
KTable/GlobalKTable would load all the data in the heap memory? If yes, then it won't help keep the java memory low.
Could some one please advise how this scenario can be done?
I suggest to turn the referencedata topic into a KTable keyed on tickerString. This means that you will have a materialised view that it is auto-updated when new referencedata are arriving into the topic. Also it means that you can query the KTable using the key (i.e. tickerString). Then you can create a KStream consuming from orders topic and join this stream to the referencedata KTable and push the resulting join to a new topic (i.e. merged_ref_orders).
Key both the referencedata KTable and the orders topic using the same key (i.e. tickerString), so that you do not have to use a GlobalKTable. A regular KTable (much more scalable) will be enough. Using a KTable with a Rocks DB will keep your memory usage very low as it utilising the partitioned data and also stores locally to your jvm data to a RocksDB.
Two official confluent examples how to join/ enrich a KStream to a KTable:
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java

Can't consume messages from topic partitions in ClickHouse

I am introducing with kafka and I want to know how can I consume mesages from partitions in topic to ClickHouse tables like this:
In case when I have 3 topics it was easy to connect tables on each topics
ENGINE = Kafka SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'topic1',
kafka_group_name = 'kafka_group',
kafka_format = 'JSONEachRow'
But I don't know how to consume messages from partitions of one topic to tables. Please help
There are multiple ways you can do that
Keep the identifier in your message like below. In your consumer you can read table attribute and take decision in which table you have to save the data.
{
table: Table1
}
Though kafka don't provide any direct way to produce method to specific partition however you can use key for that. Lets make the key with three value 1,2,3. When message is produced for Table1 use key 1. That way message will go to only one partition and then consumer for that partition can save data in Table1
Personally I'll prefer method 1 as it don't couple kafka processing with your business logic