Kafka Joins - How can we achive it?

Kafka Joins - How can we achive it? - apache-kafka

I have below topics in Kafka(10 partitions each) and want to join them as given in below example? Could someone please advise how this can be done?
Topic:"orders"
Topic: "referencedata"
Joined Topic:"merged_ref_orders"
Basically, we are loading all of the orders and reference data into a hash map and then joining them. This is causing lots of performance issue when there are lots of orders. I came through something such as KTable/GlobalKTable but not sure how does it operate internally.
KTable/GlobalKTable would load all the data in the heap memory? If yes, then it won't help keep the java memory low.
Could some one please advise how this scenario can be done?

I suggest to turn the referencedata topic into a KTable keyed on tickerString. This means that you will have a materialised view that it is auto-updated when new referencedata are arriving into the topic. Also it means that you can query the KTable using the key (i.e. tickerString). Then you can create a KStream consuming from orders topic and join this stream to the referencedata KTable and push the resulting join to a new topic (i.e. merged_ref_orders).
Key both the referencedata KTable and the orders topic using the same key (i.e. tickerString), so that you do not have to use a GlobalKTable. A regular KTable (much more scalable) will be enough. Using a KTable with a Rocks DB will keep your memory usage very low as it utilising the partitioned data and also stores locally to your jvm data to a RocksDB.
Two official confluent examples how to join/ enrich a KStream to a KTable:
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java

Related

What is the best practice to enrich data when synchronizing data with Kafka Connect

I am thinking about solutions to enrich data from Kafka.
Now I am using implementing Mongo Kafka Connect to sync all changes to Kafka. The kafka connect use the change stream to watch oplogs and public changes to Kafka. Relationship between Mongo's collection and Kafka Topic is 1:1.
On the consumer side, when it pulls data, it will get the reference id that we need to join to other collection to get the data.
To join data between collections, I have 2 solutions below.
when pulling data by consumers, it need to go back to the Mongo database to fetch or the data or join collections according to the reference key.
For this way, I concern about the number of connects that I need to go back to the Mongo database.
using kafka streaming to join data among topics.
For the second solution, I like to know how to keep that master data in the topics forever and how to maintain records in topics like db tables, so each row have unique index, and when data changes come to the topic, we can update the records.
If you have any other solutions, please let me know.

Your consumer can do whatever it wants. You may need to increase various Kafka timeout configs depending on your database lookups, though.
Kafka topics can be infinitely retained with retention.ms=-1, or by compaction. When you use compaction, it'll act similarly to a KV store (but as a log). To get an actual lookup store, you can build a KTable, then join a topic stream against it
This page covers various join patterns in Kafka Streams - https://developer.confluent.io/learn-kafka/kafka-streams/joins/
You can also use ksqlDB

Does Kafka Streams GlobalKTable topic require the same number of partitions as KStream topic which it will be joining with?

We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!

No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements

More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTable‘s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.

Reading already partitioning topic in Kafka Streams DSL

Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field as a key.
One simple way to render this round-trip unnecessary is to write the original messages topic partitioned by the msg.field in the first place. But Kafka Streams knows nothing about messages topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy would mark stream as already partitioned by msg.field. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.

Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey() instead of groupBy(). Whereas groupBy() always results in repartitioning, its sibling groupByKey() assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey() would work if key == msg.field.
Original answer below:
Repartitioning a high-volume topic in Kafka Streams could be very expensive.
Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy() function. Hence there is no need to tell it "not to partition" as you say in your question.
One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?

Combining data coming from multiple kafka to single kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks

Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.

The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);