Kafka Stream APis join between Kstream and Ktable - apache-kafka

I have an application having kstream reading from one topic multiple partition while ktable reading from another topic multiple partition. So both Kstream & Ktable will refer to same partition in two different topic or any partion in two different topic

Firstly, they have to be co-partitioned (the same number of partitions). Kafka Streams will check it out or throw an exception at startup.
Secondly, you should make sure that producers send records of the same key to the same partitions across these input topics (of a KStream and a KTable) so the consumer (the Kafka Streams application that uses join operator) will ever see any match and join succeeds.

Related

Are partitions on different Kafka topics co-located within same consumer (k8s pod)

I have a requirement where I want to be able to read data from partition 1 of topic A and partition 1 of topic B from the same consumer, I have a group of consumers running in different Kubernetes pods. Both topics will have 5 partitions each and both topics have key based partition strategy.
So assuming partition 1 on topic A and partition 1 on topic B are keyed with same key value would they both colocate on the same consumer or pod? If that's the case then I can cross reference data from one topic using the key of the other topic's message.
Keys are only relevant to the producer partitioner.
There is no guarantee that a consumer will be assigned the same partitions across two topics. The ConsumerPartitionAssignor linked below is only per-topic. You might get lucky with consumers assigned partitions with the same keys across topics, but after a rebalancing, it'll no longer be true.
If you must consume the same partition of multiple topics, you may assign() those values to the consumer instance rather than subscribe()-ing to the whole topic.
However, if you are wanting to join data across topics, the more appropriate way to do this would be to use Kafka Streams / KSQL joins.
Yes, if you configure routing by key for both topics, same key will be sent to same partition. Have a look at the documentation here : https://kafka.apache.org/documentation/#design_loadbalancing
"For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers."

Kafka Streams Co-Partitioning is required while joining two KStreams

Recently i started reading about Kafka streams for upcoming project and stumbled upon the concept which says co-partitioning is required if we want to join two streams, all i was able to understand is if we have two Topics A and B both must have same number of partitions and for key 'X' say the partition number also must be same for both topics.
Topic A with partition A0, A1 ,A2
Topic B with partition B0, B1, B2
then message with key 'X' must be publish in A0 and B0 respectively.
Question: why partition number must be same for both topic (for 'X' key) and what issues we might faced if we have same number of partition in two topics but some of partition is idle i.e messages is not distributed evenly across partition.
When you do Kafka streaming, Kafka group consumer is used. So, your topic partitions are assigned according to Kafka partitioning strategies. Default is range assigner. read here for more.
To join Two streams, Both messages with same key should be available in same consumer instance. Otherwise your streaming consumer can not find other message to join. To make sure that, Partition number should be same for both topics and key should be same.
When partition number same for both topics, Kafka Partitioning Range Assigner makes sure that same partition assigned to same instance.
This from kafka perspective. From application side, your producer should make sure to produce messages using hash partitioner. It is the default. Then if there is same number of partition for both topics, then hashing makes sure same key should go to same partition number for both topics.
Kafka streaming Co-Partitioning is doing this to make sure when your topics has not these things.

Topics generated when we do a foreign key join between two KTables in kafka streams

What all topics get generated when we do a foreign key join between two KTables in kafka streams? And What all data does they contain?
There are 2 types of intermediate topics which you can observe for Kafka Streams applications: -repartition and -changelog.
-repartition topics requires when your Kafka Streams application invokes a command changing the key.
-changelog topics represent a changelog state (when you create KTables) where each data record in the stream captures a state change of the table. Please, see duality-of-streams-and-tables for more information.

Kafka partitioner question, two topics same partition key

I have two Kafka topics on the same brokers, both topics use the same UUID as a partitioner, the UUID determines which consumer the records get sent to. If the same UUIDs are used across both topics does that guarantee the records for both topics arrive at the same consumers, I assume not.
If the topics have the same number of partitions, then the partitioner logic would map the records to the same partition.
If you're simply subscribing consumers to topics rather than using specific partition assignments, then there are no guarantees which partitions get read

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.
I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.
Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?
There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.
https://docs.confluent.io/current/streams/concepts.html#globalktable
Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html
For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.
nice reference Blog for partitioning - https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab