Kafka Streams Threading Model with more than one Stream on the same instance and JVM - apache-kafka

Hi I am trying to get a bit more of an understanding on the kafka streams threading model and I am looking at this example in the confluent docs https://docs.confluent.io/current/streams/architecture.html#example
I understand that this example is for a single 'kafka streams app' that, in the first diagram, is deployed on a single machine and allowed to use two threads (configurable). It splits itself across the the two threads leading to 3 separate 'tasks' that, I think, do the same thing as each other they are just parallelized. That much I think I understand.
My question is what if I deploy a second totally different 'kafka streams app' with its own unique client id on that same machine and in the same jvm. Will this second kafka streams app be able to use the same two (share) threads as the first or does the first stream monopolise the threads it is allowed to use.
another way of asking this might be is the minimum number of threads necessary, equal to the number of separate Kafka stream apps running on the machine?

Threads are owned by KafkaStreams instances. Thus, if you create and start multiple KafkaStreams each instance has its own threads -- they are not shared.
Btw: the number of tasks is independent of the number of KafkaStreams instances and the number of threads. The number of tasks depends on the number of partitions of your input topic as well as the structure of your topology DAG.
Also, the number of tasks effectively limits the overall parallelism. Each task is executed by exactly one thread. If you have more threads than tasks, some threads will be idle as there is no task that can be assigned to them.
One more thing: for a parallelism point of view, there is not difference if you start one KafkaStreams instance and configure it with 3 threads, or if you start three KafkaStreams instances with one thread each. All available tasks will be evenly distributed over all available threads.

Related

In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?

I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you parallelise a complex operation (making up a sub-topology) that uses the processor API and requires reading the entire topic?
Can multiple sub-topologies read from the same topic (such that independent and expensive operations on the same topic can be run in different sub-topologies)?
As the developer, we don't have direct control about how topologies are divided into sub-topologies. Kafka Streams divides the Topology into multiple sub-topologies using Topics as a "bridge" where possible. Additionally, multiple stream tasks are created that each read a subset of data from the input topic, divided by partition. The documentation reads:
Slightly simplified, the maximum parallelism at which your application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Assume there was a sub-topology that reads multiple input topics whose number of partitions are not identical. If the above excerpt of the documentation is to be believed, then one or more partitions of the topic that has less partitions would need to be assigned to multiple stream tasks (if both topics need to be read for the logic to work). However, this should not be possible, because, as I understand it, multiple instances of the streams application (each sharing the same application id) act as one Consumer group, where each partition is only assigned once. In such a case, the number of tasks being created for a sub-topology should actually be limited by the minimum number of partitions of its input topics, i.e. a single partition is only assigned to one Task.
I am not sure if the initial problem, i.e. a non-co-partitioned sub-topology would actually occur. If there is an operation that requires to read both input topics, the data would probably need to be co-partitioned (like in Joins).
Say there was an expensive operation between two topics (possibly built from multiple custom processors) that requires the data of one topic to always be available in its entirety. You would want to parallelise this operation into multiple tasks.
If the topic had just one partition, and a partition could be read multiple times, this would not be a problem. However, as discussed earlier, I don't believe this to work.
Then there are GlobalKTables. However, there is no way to use GlobalKTables with custom processors (toStream is not available).
Another idea would be to broadcast the data into multiple partitions, essentially duplicating it by the partition count. This way, multiple stream tasks could be created for the topology to read the same data. To do this, a custom partitioner could be specified in the Produced-Instance given to KStream#to. If this data duplication can be accepted, this seems to be the only way to achieve what I have in mind.
Regarding question number three, because the Streams application is one Consumer group, I would also expect this to not be possible. With my current understanding, this would require to write the data into multiple identical topics (again essentially duplicating the data), such that independent sub-topologies can be created. An alternative would be to run separate streaming applications (such that a different consumer group is used).
Without seeing your topology definition, this is a somewhat vague question. You can have repartition and changelog topics. These are duplicated data from the original input topic.
But stateless operators like map, filter, etc. pass data through from the same (assigned) partitions for each thread.
A "sub topology" is still part of only one application.id, thus one consumer group, so no, it cannot read the same topic partitions more than once. For that, you'd need independent streams/tables via branching operations within the whole topology, for example, filtering numbers by even and odd only consumes the topic once; you don't need to "broadcast" records to all partitions, and I'm not sure that's even possible out of the box (to sends one-to-one, and Produced defines serialization, not multiple partitions). If you need to cross reference different operators, then you can use join / statestores / KTables.
None of this is really related to parallelism. You have num.stream.threads, or you can run multiple instances of the same JVM process to scale.

Is there an option to configure num stream thread for specific topic(s) instead of all?

In our application we have multiple topics where some topics will be created with 16 partition and some topics will be created with 1 partition. Is there any spring.cloud.stream.kafka.bindings property/option available to achieve this?
Maybe this helps: num.stream.threads creating idle threads
If there is one KafkaStreams instance it is not possible because Kafka Streams does only have a global config. Hence, you would need to have multiple applications, i.e., multiple KafkaStreams instances that process different input topic to configure each with a different number of threads. Following the answer from above, it seems that spring-cloud-streams can create multiple KafkaStreams clients to support what you want.
However, I am not sure why you would want/need this (but I am also not exaclty sure how spring-cloud-stream translates your program)? In the end, parallelization is done based on tasks and thus for single input topic partitions only one of your thread will get the corresponding task assigned. Thus, there is no overhead you need to worry about.
For more details check out: https://docs.confluent.io/current/streams/architecture.html#parallelism-model
There are several partition properties available. For example,
spring.cloud.stream.bindings.func-out-0.producer.partitionKeyExpression=payload.id
spring.cloud.stream.bindings.func-out-0.producer.partition
You can get more information on both producer and consumer configuration properties here

Kafka Streams : Sharing globalStateStore across topologies

I have a Spring boot application that uses Processor API to generate a Topology and also a addGlobalStateStore to the same topology.
I want to create another topology (and hence another KafkaStreams) for reading from another set of topics and want to share the previously created store in the new topology. By share I mean that the underlying state store should be the same for both topologies. Any data written from one topology should be visible in the other.
Is that possible without writing wrapper endpoints to access the state store e.g. REST calls?
Or does my usecase need an external state store e.g. redis?
No, you can't share the state stores across topologies. Instead if possible, you can break down your topologies as sub-topologies and that will make it available across all the processors defined.
If that is not possible for you, you can use external storage.
According to Stream Partitions and Tasks:
Sub-topologies (also called sub-graphs): If there are multiple
processor topologies specified in a Kafka Streams application, each
task only instantiates one of the topologies for processing. In
addition, a single processor topology may be decomposed into
independent sub-topologies (or sub-graphs). A sub-topology is a set of
processors, that are all transitively connected as parent/child or
via state stores in the topology. Hence, different sub-topologies
exchange data via topics and don’t share any state stores. Each task
may instantiate only one such sub-topology for processing. This
further scales out the computational workload to multiple tasks.
This means that sub-topologies (hence topologies too) can't share any state stores.
Solution for your scenario:
create a single KafkaStreams instance with its topology containing everything you otherwise would put in your 2 distinct topologies. This will determine the creation of a single task for the entire topology (because of that store usage); there'll be no sub-topology because you are using a store used by both initially distinct topologies. This also means that the entire topology can be run by a single thread only (this is the main drawback), can't be splitted in sub-topologies to be run by multiple threads - this doesn't mean that the topology as a whole can't be run by multiple threads depending on the chosen parallelism (num.stream.threads).

Kafka Streams number of threads, on repartition

I have a Kafka Streams application, reading from one Kafka topic with 5 partitions.
The data is then aggregated/repartitioned several times.
I tried to find the recommendation for the number of threads in this scenario, but find it difficult to understand. The documentation writes as follows:
You can start as many threads of the application as there are input
Kafka topic partitions
Which means in my case 5 threads are the maximum number of effective threads.
But, this blog, claims that a repartition doubles the number maximum effective threads:
... This topic is automatically created with the same
number of partitions as the source topic, meaning our application was
now reading from 16 partitions with 8 threads, thus creating some kind
of contention.
Which is also reasonable to me, because Kafka Streams will have to read from the internal topics it creates too.
So, is the maximal number of effective thread 5 partitions, or 5*(repartitions)?
You can start as many threads as you like. However, only a certain amount of threads will be utilized, while all others would be idle.
The maximum number of utilized threads is the number of tasks that are created.
A topology is split into sub-topologies and the number of input topic partitions of each sub-topology determines the number of created tasks per sub-topology. If you configures standby tasks, you get additional tasks that can utilize threads, too.
In general, it's hard to tell up-front how many task Kafka Streams created. You can get the sub-topologies via Topology#describe(). If all topics have the same number of partitions, the number of task would be #numPartition * #numSubTopoogies.
The docs are simplified on purpose, because the exact number of tasks is hard to determine in advance because there are many dependencies. Also, usually one does not need a thread per task and a rule of thumb is good enough to get started.

Assign different group id to different consumers in the same application

I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!