Enrich kafka stream from different topics using different keys - apache-kafka

How can I join a stream with another topic with different keys?
All topics for all tables/streams below are being sourced by DB tables using Kafka connect.
My app is a Spring Cloud Stream with Kafka written as a single consumer in a group, so that I can consume all partitions per topic.
I have a stream like this:
Stream S1 from topic S has 3 partitions:
key: keyval|someval, val1: "keyval", val2: "someval"
key: keyval|someval1, val1: "keyval", val2: "someval1"
and another topic like this
T1 w/ 6 partitions:
key: tabval|keyval, val1: "tabval", val2: "keyval", val3: "someval"
key: tabval1|keyval1, val1: "tabval1", val2: "keyval", val3: "someval"
here tabval|keyval is the key that's loaded based on them being foreign keys from 2 other tables on the actual db table we load this topic from.
I have tried using a GlobalKTable for reading T1 and do a leftJoin() with stream S1 but I cannot do that as S1 doesn't have the part of the key in T1 i.e tabval
Now if I try to key T1 using keyval|someval I will only get latest write on that key when I make T1 a GlobalKtable and consume, I guess this would happen even if I just use it as a KTable?
But I need all records from T1 that match keyval|someval combo from the topic S and not just the last update.
Should I consume T1 as a table then to a stream and then rekey and merge based on a window()?
I have to consume all partitions from both topics in this consumer so the lookup works and in the end I have all data from T1
How can I achieve this please.
Thank you

Just giving you a gist of how join works in kakfa between different data abstraction (KStreame/KTable/GlobalKTable), that you need to understand before you implement the same.
Join co-partitioning requirements.
Input data must be co-partitioned when joining.
This ensures that input records with the same key, from both sides of the join, are delivered to the same stream task during processing.
It is the responsibility of the user to ensure data co-partitioning when joining.
Consider using global tables (GlobalKTable) for joining because they do not require data co-partitioning.
Why is data co-partitioning required?
KStream-KStream, KTable-KTable, and KStream-KTable joins are performed based on the keys of records (e.g., leftRecord.key == rightRecord.key).
It is required that the input streams/tables of a join are co-partitioned by key.
KStream-GlobalKTable joins is exempted as co-partitioning is not required because all partitions of the GlobalKTable‘s underlying changelog stream are made available to each KafkaStreams instance, i.e. each instance has a full copy of the changelog stream. (Ref : https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/src/main/java/io/confluent/examples/streams/GlobalKTablesExample.java)
A KeyValueMapper also allows for non-key based joins from the KStream to the GlobalKTable.
Ensuring data co-partitioning
If the inputs of a join are not co-partitioned yet, we must ensure it manually by following the below procedure.
Identify the input KStream/KTable in the join whose underlying Kafka topic has the smaller number of partitions. Let’s call this stream/table “SMALLER”, and the other side of the join “LARGER”.
Pre-create a new Kafka topic for “SMALLER” that has the same number of partitions as “LARGER”. Let’s call this new topic “repartitioned-topic-for-smaller”.
Within your application, re-write the data of “SMALLER” into the new Kafka topic. You must ensure that, when writing the data with to or through, the same partitioner is used as for “LARGER”.
Once you understand these pre-requisites clearly, match with your requirement, whether it is valid, if yes, you can find appropriate samples from here https://github.com/confluentinc/kafka-streams-examples/blob/ce9be56c214914cc9b1342c32fd59c06141e500a/README.md (Just do search with 'join')

Related

Kafka Joins - How can we achive it?

I have below topics in Kafka(10 partitions each) and want to join them as given in below example? Could someone please advise how this can be done?
Topic:"orders"
Topic: "referencedata"
Joined Topic:"merged_ref_orders"
Basically, we are loading all of the orders and reference data into a hash map and then joining them. This is causing lots of performance issue when there are lots of orders. I came through something such as KTable/GlobalKTable but not sure how does it operate internally.
KTable/GlobalKTable would load all the data in the heap memory? If yes, then it won't help keep the java memory low.
Could some one please advise how this scenario can be done?
I suggest to turn the referencedata topic into a KTable keyed on tickerString. This means that you will have a materialised view that it is auto-updated when new referencedata are arriving into the topic. Also it means that you can query the KTable using the key (i.e. tickerString). Then you can create a KStream consuming from orders topic and join this stream to the referencedata KTable and push the resulting join to a new topic (i.e. merged_ref_orders).
Key both the referencedata KTable and the orders topic using the same key (i.e. tickerString), so that you do not have to use a GlobalKTable. A regular KTable (much more scalable) will be enough. Using a KTable with a Rocks DB will keep your memory usage very low as it utilising the partitioned data and also stores locally to your jvm data to a RocksDB.
Two official confluent examples how to join/ enrich a KStream to a KTable:
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java

Topics generated when we do a foreign key join between two KTables in kafka streams

What all topics get generated when we do a foreign key join between two KTables in kafka streams? And What all data does they contain?
There are 2 types of intermediate topics which you can observe for Kafka Streams applications: -repartition and -changelog.
-repartition topics requires when your Kafka Streams application invokes a command changing the key.
-changelog topics represent a changelog state (when you create KTables) where each data record in the stream captures a state change of the table. Please, see duality-of-streams-and-tables for more information.

Does Kafka Streams GlobalKTable topic require the same number of partitions as KStream topic which it will be joining with?

We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!
No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements
More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTable‘s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.

Can't consume messages from topic partitions in ClickHouse

I am introducing with kafka and I want to know how can I consume mesages from partitions in topic to ClickHouse tables like this:
In case when I have 3 topics it was easy to connect tables on each topics
ENGINE = Kafka SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'topic1',
kafka_group_name = 'kafka_group',
kafka_format = 'JSONEachRow'
But I don't know how to consume messages from partitions of one topic to tables. Please help
There are multiple ways you can do that
Keep the identifier in your message like below. In your consumer you can read table attribute and take decision in which table you have to save the data.
{
table: Table1
}
Though kafka don't provide any direct way to produce method to specific partition however you can use key for that. Lets make the key with three value 1,2,3. When message is produced for Table1 use key 1. That way message will go to only one partition and then consumer for that partition can save data in Table1
Personally I'll prefer method 1 as it don't couple kafka processing with your business logic

Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.
The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);