KStream-KStream Join vs KStream-KTable Join Performance - apache-kafka

I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
I am making KStream-KStream join which creates 2 internal topics.
While KStream-KTable join will create 1 internal topic + 1 table.
Which is better in terms of performance and other factors ?

The choice is not a question of performance, but a question of semantics: what should the join result be? Both joins, do compute quite different results so you should pick the semantics that meet your application needs.
The different semantics are documented in CP docs and AK wiki:
https://docs.confluent.io/current/streams/developer-guide.html#joining
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics

Related

Consumer side behavior on using coGroupByKey in Apache beam

I have a beam job that reads data from 2 kafka producers and does a join using a common key in both streams. I am not using the partition key used by kafka to do the join. So essentially kafka partitions data by some key in both streams, my consumer/beam job consumes this data from the two streams and extracts the actual key using which I wish to perform join into a pCollection and then I run coGroupByKey.
I see the join happen for several events, but if I query for specific events, I do not see the join happen. I have used the same window to window into the two streams. This makes me question if a consumer is getting the right data from two streams to perform this join. Let's say that consumer 0 consumes from partition 0 of both streams. Is there a chance that kafka partitions data using a key x and my consumer 0 is not getting the right data to join across the streams. I was told that coGroupByKey ensures that the right data lands in each consumer, but I am not able to visualize this. How can using coGroupByKey affect the input side behavior?
CoGroupByKey will join data across all input partitions. I suspect the issue is windowing--are the unjoined items in the same window? (CoGroupByKey does not join across windows, so items that land in separate windows do not get joined. You could look at using session windows if fixed windows don't work.)

Does Kafka Streams GlobalKTable topic require the same number of partitions as KStream topic which it will be joining with?

We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!
No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements
More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTableā€˜s underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.
I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.
Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?
There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.
https://docs.confluent.io/current/streams/concepts.html#globalktable
Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html
For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.
nice reference Blog for partitioning - https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab

Any advantages splitting up Kafka Topics

I am working on a application/Kafka Cluster which will be producing/consuming messages (around 100k a second) to a Topic. The message format is identical so my initial thoughts were to have a single topic for all messages.
However is there any benefits to Kafka to split the messages into multiple Topics? There is a logical separation which could be applied which could split the topic into multiple (10ish) topics.
Apart from the Producer/Consumer side of things. Does Kafka itself have any preferences around performance, redundancy, stability, management etc by having 1 large topic versus multiple smaller topics?
Topic partitions are the usual means of parallelizing Kafka, however you could opt to split it into multiple topics as well if you wanted. But I would first look into the partition aspect of things. Here is a good Confluent article on how to pick the right number of partitions. Especially note that if you are partitioning on keys then adding partitions after the fact can result in split data, so think through it properly up front as best as you can.
Parallelism in kafka depends on the number of partitions in a topic.There will be an increase in throughput of data as long as the number of partitions is optimal(unnecessarily large number of partitions will create overhead).By increasing the number of consumer you can streams message from partitions simultaneously

Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.
The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);