Does Kafka Streams GlobalKTable topic require the same number of partitions as KStream topic which it will be joining with? - apache-kafka

We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!

No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements

More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTableā€˜s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.

Related

Kafka Streams: Increasing topic partitions for an application performing a KTable-KTable foreign key join

Most of the information I find relates to primary key joins. I understand foreign key joins are a relatively new feature for Kafka Streams. I'm interested in how this will scale. I understand that Kafka Streams parallelism is capped by the number of partitions on each topic, however I have a few questions around what it means to increase the input topic partitions.
Does the foreign key join have the same requirement to co-partition input topics? That is, do both topics need to have the same number of partitions?
How does one add a partitions later after the application has been running in production for months or years? The changelog topics backing each KTable store data from certain input topic partitions. If one is to increase the partitions in the input topics, how does this impact our KTables' state stores and changelogs? Presumably, we cannot just start over and lose that data since it has accumulated over months and years and is essential to performing the join. It may not be quickly replaced by upstream data. Do we need to blow away our state stores, create new input topics, and re-send all KTable changelog topic data to them?
How about the other internal "subscription" topics?
Does the foreign key join have the same requirement to co-partition input topics? That is, do both topics need to have the same number of partitions?
No. For more details check out https://www.confluent.io/blog/data-enrichment-with-kafka-streams-foreign-key-joins/
How does one add a partitions later after the application has been running in production for months or years?
You cannot really do this, even if you don't use Kafka Streams. The issue is, that your input data is partitioned by key, and if you add a partition the partitioning in your input topic breaks. -- The recommended pattern is to create a new topic with different number of partitions.
The changelog topics backing each KTable store data from certain input topic partitions. If one is to increase the partitions in the input topics, how does this impact our KTables' state stores and changelogs?
It would break the application. In fact, Kafka Streams will check and will raise an exception if it detect that the number of input topic partitions does not match the number of changelog topic partitions.

Kafka Stream APis join between Kstream and Ktable

I have an application having kstream reading from one topic multiple partition while ktable reading from another topic multiple partition. So both Kstream & Ktable will refer to same partition in two different topic or any partion in two different topic
Firstly, they have to be co-partitioned (the same number of partitions). Kafka Streams will check it out or throw an exception at startup.
Secondly, you should make sure that producers send records of the same key to the same partitions across these input topics (of a KStream and a KTable) so the consumer (the Kafka Streams application that uses join operator) will ever see any match and join succeeds.

Kafka Joins - How can we achive it?

I have below topics in Kafka(10 partitions each) and want to join them as given in below example? Could someone please advise how this can be done?
Topic:"orders"
Topic: "referencedata"
Joined Topic:"merged_ref_orders"
Basically, we are loading all of the orders and reference data into a hash map and then joining them. This is causing lots of performance issue when there are lots of orders. I came through something such as KTable/GlobalKTable but not sure how does it operate internally.
KTable/GlobalKTable would load all the data in the heap memory? If yes, then it won't help keep the java memory low.
Could some one please advise how this scenario can be done?
I suggest to turn the referencedata topic into a KTable keyed on tickerString. This means that you will have a materialised view that it is auto-updated when new referencedata are arriving into the topic. Also it means that you can query the KTable using the key (i.e. tickerString). Then you can create a KStream consuming from orders topic and join this stream to the referencedata KTable and push the resulting join to a new topic (i.e. merged_ref_orders).
Key both the referencedata KTable and the orders topic using the same key (i.e. tickerString), so that you do not have to use a GlobalKTable. A regular KTable (much more scalable) will be enough. Using a KTable with a Rocks DB will keep your memory usage very low as it utilising the partitioned data and also stores locally to your jvm data to a RocksDB.
Two official confluent examples how to join/ enrich a KStream to a KTable:
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java

Topics generated when we do a foreign key join between two KTables in kafka streams

What all topics get generated when we do a foreign key join between two KTables in kafka streams? And What all data does they contain?
There are 2 types of intermediate topics which you can observe for Kafka Streams applications: -repartition and -changelog.
-repartition topics requires when your Kafka Streams application invokes a command changing the key.
-changelog topics represent a changelog state (when you create KTables) where each data record in the stream captures a state change of the table. Please, see duality-of-streams-and-tables for more information.

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.
I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.
Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?
There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.
https://docs.confluent.io/current/streams/concepts.html#globalktable
Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html
For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.
nice reference Blog for partitioning - https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab