Kafka Streams : Sharing globalStateStore across topologies - apache-kafka

I have a Spring boot application that uses Processor API to generate a Topology and also a addGlobalStateStore to the same topology.
I want to create another topology (and hence another KafkaStreams) for reading from another set of topics and want to share the previously created store in the new topology. By share I mean that the underlying state store should be the same for both topologies. Any data written from one topology should be visible in the other.
Is that possible without writing wrapper endpoints to access the state store e.g. REST calls?
Or does my usecase need an external state store e.g. redis?

No, you can't share the state stores across topologies. Instead if possible, you can break down your topologies as sub-topologies and that will make it available across all the processors defined.
If that is not possible for you, you can use external storage.

According to Stream Partitions and Tasks:
Sub-topologies (also called sub-graphs): If there are multiple
processor topologies specified in a Kafka Streams application, each
task only instantiates one of the topologies for processing. In
addition, a single processor topology may be decomposed into
independent sub-topologies (or sub-graphs). A sub-topology is a set of
processors, that are all transitively connected as parent/child or
via state stores in the topology. Hence, different sub-topologies
exchange data via topics and don’t share any state stores. Each task
may instantiate only one such sub-topology for processing. This
further scales out the computational workload to multiple tasks.
This means that sub-topologies (hence topologies too) can't share any state stores.
Solution for your scenario:
create a single KafkaStreams instance with its topology containing everything you otherwise would put in your 2 distinct topologies. This will determine the creation of a single task for the entire topology (because of that store usage); there'll be no sub-topology because you are using a store used by both initially distinct topologies. This also means that the entire topology can be run by a single thread only (this is the main drawback), can't be splitted in sub-topologies to be run by multiple threads - this doesn't mean that the topology as a whole can't be run by multiple threads depending on the chosen parallelism (num.stream.threads).

Related

In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?

I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you parallelise a complex operation (making up a sub-topology) that uses the processor API and requires reading the entire topic?
Can multiple sub-topologies read from the same topic (such that independent and expensive operations on the same topic can be run in different sub-topologies)?
As the developer, we don't have direct control about how topologies are divided into sub-topologies. Kafka Streams divides the Topology into multiple sub-topologies using Topics as a "bridge" where possible. Additionally, multiple stream tasks are created that each read a subset of data from the input topic, divided by partition. The documentation reads:
Slightly simplified, the maximum parallelism at which your application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Assume there was a sub-topology that reads multiple input topics whose number of partitions are not identical. If the above excerpt of the documentation is to be believed, then one or more partitions of the topic that has less partitions would need to be assigned to multiple stream tasks (if both topics need to be read for the logic to work). However, this should not be possible, because, as I understand it, multiple instances of the streams application (each sharing the same application id) act as one Consumer group, where each partition is only assigned once. In such a case, the number of tasks being created for a sub-topology should actually be limited by the minimum number of partitions of its input topics, i.e. a single partition is only assigned to one Task.
I am not sure if the initial problem, i.e. a non-co-partitioned sub-topology would actually occur. If there is an operation that requires to read both input topics, the data would probably need to be co-partitioned (like in Joins).
Say there was an expensive operation between two topics (possibly built from multiple custom processors) that requires the data of one topic to always be available in its entirety. You would want to parallelise this operation into multiple tasks.
If the topic had just one partition, and a partition could be read multiple times, this would not be a problem. However, as discussed earlier, I don't believe this to work.
Then there are GlobalKTables. However, there is no way to use GlobalKTables with custom processors (toStream is not available).
Another idea would be to broadcast the data into multiple partitions, essentially duplicating it by the partition count. This way, multiple stream tasks could be created for the topology to read the same data. To do this, a custom partitioner could be specified in the Produced-Instance given to KStream#to. If this data duplication can be accepted, this seems to be the only way to achieve what I have in mind.
Regarding question number three, because the Streams application is one Consumer group, I would also expect this to not be possible. With my current understanding, this would require to write the data into multiple identical topics (again essentially duplicating the data), such that independent sub-topologies can be created. An alternative would be to run separate streaming applications (such that a different consumer group is used).
Without seeing your topology definition, this is a somewhat vague question. You can have repartition and changelog topics. These are duplicated data from the original input topic.
But stateless operators like map, filter, etc. pass data through from the same (assigned) partitions for each thread.
A "sub topology" is still part of only one application.id, thus one consumer group, so no, it cannot read the same topic partitions more than once. For that, you'd need independent streams/tables via branching operations within the whole topology, for example, filtering numbers by even and odd only consumes the topic once; you don't need to "broadcast" records to all partitions, and I'm not sure that's even possible out of the box (to sends one-to-one, and Produced defines serialization, not multiple partitions). If you need to cross reference different operators, then you can use join / statestores / KTables.
None of this is really related to parallelism. You have num.stream.threads, or you can run multiple instances of the same JVM process to scale.

Kafka Stateful Stream processor with statestore: Behind the scenes

I am trying to understand Stateful Stream processor.
As I understand in this type of stream-processor, it maintains some sort of state using State Store.
I came to know, one of the ways to implement State Store is using RocksDB. Assuming the following topology (and only one processor being stateful)
A->B->C ; processor B as stateful with local state store and changelog enabled. I am using low level API.
Assuming the sp listens on a single kafka topic, say topic-1 with 10 partitions.
I observed, that when the application is started (2 instances in different physical machines and num.stream.threads = 5), then for state store it creates directory structure which
has something like below:
0_0 , 0_1, 0_2.... 0_9 (Each machines has five so total 10 partitions).
I was going through some online material where it said we should create a StoreBuilder and attach it topology using addStateStore() instead of creating a state store within a processor.
Like:
topology.addStateStore(storeBuilder,"processorName")
Ref also: org.apache.kafka.streams.state.Store
I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?
The second part: For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?
I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?
In order to let Kafka Streams manage the store for you (fault-tolerance, migration), Kafka Streams needs to be aware of the store. Thus, you give Kafka Streams a StoreBuilder and Kafka Streams creates and manages the store for you.
If you just create a store inside your processor, Kafka Streams is not aware of the store and the store won't be fault-tolerant.
For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?
Yes, there is a mapping. The store is shared base in the number of input topic partitions. You also get a "task" per partition and the task directories are name y_z with y being the sub-topology number and z being the partition number. For your simple topology you only have one sub-topology to all directories you see have the same 0_ prefix.
Hence, you logical store has 10 physical shards. This sharding allows Kafka Streams to mirgrate state when the corresponding input topic partition is assigned to a different instance. Overall, you can run up to 10 instanced and each would process one partition, and host one shard of your store.

Kafka local state store of multiple partitions

I am using kafka processor api and I create a state store from a topic of 3 partitions(I have 3 brokers), I have 1 instance of stream. I wonder to know when I get local state store, can I get all keys? Why certain keys work but certain don't? Is it normal?
Thank you
The number if application instances does not matter for this case. Because the input topic has 3 partitions, the state store is created with 3 shards. Processing happens with 3 parallel tasks. Each task instantiates a copy of your topology, processes one input topic partition, and uses one shard.
Compare: https://kafka.apache.org/21/documentation/streams/architecture
If you want to access different shards, you can use "Interactive Queries" feature for key/value lookups (and key-range queried) over all shards.
Also, the is the notion of a global state store, that would load data from all partitions into a single store (not sharding). However, it provided different semantics compared to "regular" stores, because store updates are not time-synchronized with the other processing.

Kafka Streams - all instances local store pointing to the same topic

We have the following problem:
We want to listen on certain Kafka topic and build it's "history" - so for specified key extract some data, add it to already existing list for that key (or create a new one if it does not exist) an put it to another topic, which has only single partition and is highly compacted. Another app can just listen on that topic and update it's history list.
I'm thinking how does it fit with Kafka streams library. We can certainly use aggregation:
msgReceived.map((key, word) -> new KeyValue<>(key, word))
.groupBy((k,v) -> k, stringSerde, stringSerde)
.aggregate(String::new,
(k, v, stockTransactionCollector) -> stockTransactionCollector + "|" + v,
stringSerde, "summaries2")
.to(stringSerde, stringSerde, "transaction-summary50");
which creates a local store backed by Kafka and use it as history table.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Question is, if I just set the same applicationId for each running instance, should it eventually replay all data from the very same kafka topic that each running instance has the same local state?
Why would you create multiple apps with different ID's to perform the same job? The way Kafka achieves parallelism is through tasks:
An application’s processor topology is scaled by breaking it into multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based on the input stream partitions for the application, with each task assigned a list of partitions from the input streams (i.e., Kafka topics). The assignment of partitions to tasks never changes so that each task is a fixed unit of parallelism of the application.
Tasks can then instantiate their own processor topology based on the assigned partitions; they also maintain a buffer for each of its assigned partitions and process messages one-at-a-time from these record buffers. As a result stream tasks can be processed independently and in parallel without manual intervention.
If you need to scale your app, you can start new instances running the same app (same application ID), and some of the already assigned tasks will reassigned to the new instance. The migration of the local state stores will be automatically handled by the library:
When the re-assignment occurs, some partitions – and hence their corresponding tasks including any local state stores – will be “migrated” from the existing threads to the newly added threads. As a result, Kafka Streams has effectively rebalanced the workload among instances of the application at the granularity of Kafka topic partitions.
I recommend you to have a look to this guide.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Some assumptions are not correct:
if you run multiple instances of your application to scale your app, all of them must have the same application ID (cf. Kafka's consumer group management protocol) -- otherwise, load will not be shared because each instance will be considered an own application, and each instance will get all partitions assigned.
Thus, if all instanced do use the same application ID, all running application instance will use the same changelog topic name and thus, what you intend to do, should work out-of-the box.

Is Kafka Stream StateStore global over all instances or just local?

In Kafka Stream WordCount example, it uses StateStore to store word counts. If there are multiple instances in the same consumer group, the StateStore is global to the group, or just local to an consumer instance?
Thnaks
This depends on your view on a state store.
In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.
On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.
Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.
Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).
For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html
Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
If worth mentioning that there is a GlobalKTable improvement proposal
GlobalKTable will be fully replicated once per KafkaStreams instance.
That is, each KafkaStreams instance will consume all partitions of the
corresponding topic.
From the Confluent Platform's mailing list, I've got this information
You could start
prototyping using Kafka 0.10.2 (or trunk) branch...
0.10.2-rc0 already has GlobalKTable!
Here's the actual PR.
And the person that told me that was Matthias J. Sax ;)
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.