Kafka Streams - all instances local store pointing to the same topic - apache-kafka

We have the following problem:
We want to listen on certain Kafka topic and build it's "history" - so for specified key extract some data, add it to already existing list for that key (or create a new one if it does not exist) an put it to another topic, which has only single partition and is highly compacted. Another app can just listen on that topic and update it's history list.
I'm thinking how does it fit with Kafka streams library. We can certainly use aggregation:
msgReceived.map((key, word) -> new KeyValue<>(key, word))
.groupBy((k,v) -> k, stringSerde, stringSerde)
.aggregate(String::new,
(k, v, stockTransactionCollector) -> stockTransactionCollector + "|" + v,
stringSerde, "summaries2")
.to(stringSerde, stringSerde, "transaction-summary50");
which creates a local store backed by Kafka and use it as history table.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Question is, if I just set the same applicationId for each running instance, should it eventually replay all data from the very same kafka topic that each running instance has the same local state?

Why would you create multiple apps with different ID's to perform the same job? The way Kafka achieves parallelism is through tasks:
An application’s processor topology is scaled by breaking it into multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based on the input stream partitions for the application, with each task assigned a list of partitions from the input streams (i.e., Kafka topics). The assignment of partitions to tasks never changes so that each task is a fixed unit of parallelism of the application.
Tasks can then instantiate their own processor topology based on the assigned partitions; they also maintain a buffer for each of its assigned partitions and process messages one-at-a-time from these record buffers. As a result stream tasks can be processed independently and in parallel without manual intervention.
If you need to scale your app, you can start new instances running the same app (same application ID), and some of the already assigned tasks will reassigned to the new instance. The migration of the local state stores will be automatically handled by the library:
When the re-assignment occurs, some partitions – and hence their corresponding tasks including any local state stores – will be “migrated” from the existing threads to the newly added threads. As a result, Kafka Streams has effectively rebalanced the workload among instances of the application at the granularity of Kafka topic partitions.
I recommend you to have a look to this guide.

My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Some assumptions are not correct:
if you run multiple instances of your application to scale your app, all of them must have the same application ID (cf. Kafka's consumer group management protocol) -- otherwise, load will not be shared because each instance will be considered an own application, and each instance will get all partitions assigned.
Thus, if all instanced do use the same application ID, all running application instance will use the same changelog topic name and thus, what you intend to do, should work out-of-the box.

Related

Can we share an application level cache between multiple Kafka Streams tasks

Let's say I have an in memory cache in a Kafka Streams application. the input topic has 2 partitions so for maximum parallelism I configure 1 streams application instance with 2 threads.
Within my stream processor, I make remote call to fetch some data and put it in a Map to cache it.
Since Kafka streams will assign 1 thread to each task and both tasks will try to update the cached map in parallel, do I have to take care of making the cached map thread safe? Is it not advisable to share an application level cache in an application instance that could be running multiple Kafka streams tasks?
I believe what you are looking for, is a GlobalKTable, which stores data from all the partitions. The way I see it is, you would need to make that remote call, push the result into a topic and then use that topic to create a GlobalKTable within the same app. GlobalKTable is backed by a RocksDB instance which stores data in your "local" file system, and can be queried using the key, much like how you would query a Map.
Word of caution: GlobalKTable source topics can get really huge and might impact your startup times if you aren't using a persistent file system, since the GlobalKTable needs to be hydrated with all the data on the "source" topic (this is done by GlobalStreamThread) before the app actually starts. So, you might want to configure compaction on the "source" topic.

Kafka Stateful Stream processor with statestore: Behind the scenes

I am trying to understand Stateful Stream processor.
As I understand in this type of stream-processor, it maintains some sort of state using State Store.
I came to know, one of the ways to implement State Store is using RocksDB. Assuming the following topology (and only one processor being stateful)
A->B->C ; processor B as stateful with local state store and changelog enabled. I am using low level API.
Assuming the sp listens on a single kafka topic, say topic-1 with 10 partitions.
I observed, that when the application is started (2 instances in different physical machines and num.stream.threads = 5), then for state store it creates directory structure which
has something like below:
0_0 , 0_1, 0_2.... 0_9 (Each machines has five so total 10 partitions).
I was going through some online material where it said we should create a StoreBuilder and attach it topology using addStateStore() instead of creating a state store within a processor.
Like:
topology.addStateStore(storeBuilder,"processorName")
Ref also: org.apache.kafka.streams.state.Store
I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?
The second part: For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?
I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?
In order to let Kafka Streams manage the store for you (fault-tolerance, migration), Kafka Streams needs to be aware of the store. Thus, you give Kafka Streams a StoreBuilder and Kafka Streams creates and manages the store for you.
If you just create a store inside your processor, Kafka Streams is not aware of the store and the store won't be fault-tolerant.
For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?
Yes, there is a mapping. The store is shared base in the number of input topic partitions. You also get a "task" per partition and the task directories are name y_z with y being the sub-topology number and z being the partition number. For your simple topology you only have one sub-topology to all directories you see have the same 0_ prefix.
Hence, you logical store has 10 physical shards. This sharding allows Kafka Streams to mirgrate state when the corresponding input topic partition is assigned to a different instance. Overall, you can run up to 10 instanced and each would process one partition, and host one shard of your store.

Kafka Streams : Sharing globalStateStore across topologies

I have a Spring boot application that uses Processor API to generate a Topology and also a addGlobalStateStore to the same topology.
I want to create another topology (and hence another KafkaStreams) for reading from another set of topics and want to share the previously created store in the new topology. By share I mean that the underlying state store should be the same for both topologies. Any data written from one topology should be visible in the other.
Is that possible without writing wrapper endpoints to access the state store e.g. REST calls?
Or does my usecase need an external state store e.g. redis?
No, you can't share the state stores across topologies. Instead if possible, you can break down your topologies as sub-topologies and that will make it available across all the processors defined.
If that is not possible for you, you can use external storage.
According to Stream Partitions and Tasks:
Sub-topologies (also called sub-graphs): If there are multiple
processor topologies specified in a Kafka Streams application, each
task only instantiates one of the topologies for processing. In
addition, a single processor topology may be decomposed into
independent sub-topologies (or sub-graphs). A sub-topology is a set of
processors, that are all transitively connected as parent/child or
via state stores in the topology. Hence, different sub-topologies
exchange data via topics and don’t share any state stores. Each task
may instantiate only one such sub-topology for processing. This
further scales out the computational workload to multiple tasks.
This means that sub-topologies (hence topologies too) can't share any state stores.
Solution for your scenario:
create a single KafkaStreams instance with its topology containing everything you otherwise would put in your 2 distinct topologies. This will determine the creation of a single task for the entire topology (because of that store usage); there'll be no sub-topology because you are using a store used by both initially distinct topologies. This also means that the entire topology can be run by a single thread only (this is the main drawback), can't be splitted in sub-topologies to be run by multiple threads - this doesn't mean that the topology as a whole can't be run by multiple threads depending on the chosen parallelism (num.stream.threads).

Kafka Streams: Stream tasks moving across app instances

Consider stream application setup with an input topic of 6 partitions that has a state store. Assume there is a constant inflow of over 5 million records each hour. If the application is run on a single node, the state for all the incoming records remains in the same node. Now, if we add another instance on a different node, I assume it would equally balance the partitions between the two instances (assume we set the max threads as 3 in each instance).
I guess my question is when the rebalance occurs and if a partition instance moves from one to another one and vice versa, this will cause the state store to be restored for those partitions on their respective instances and that takes time. Wouldn't the frequent shuffle of the partitions between the instances (especially of significant volume) due to the rebalance be a major overhead and impact the streaming performance. I am not sure if it is possible to completely prevent the rebalance (which I understand is for the load balancing benefit), but would this not prevent scaling up with multiple instances for the same topic that uses the store?
Kafka Streams uses its own implementation of PartitionAssignor (not the default one used by KafkaConsumer) and implements a sticky assignment strategy. During rebalance, it's known which partitions were assigned to what KafkaStreams instance and we try to reassign partitions to the same instance if possible to avoid state movement. Load balancing also plays a role to allow for scaling scenarios of course.

Is Kafka Stream StateStore global over all instances or just local?

In Kafka Stream WordCount example, it uses StateStore to store word counts. If there are multiple instances in the same consumer group, the StateStore is global to the group, or just local to an consumer instance?
Thnaks
This depends on your view on a state store.
In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.
On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.
Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.
Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).
For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html
Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
If worth mentioning that there is a GlobalKTable improvement proposal
GlobalKTable will be fully replicated once per KafkaStreams instance.
That is, each KafkaStreams instance will consume all partitions of the
corresponding topic.
From the Confluent Platform's mailing list, I've got this information
You could start
prototyping using Kafka 0.10.2 (or trunk) branch...
0.10.2-rc0 already has GlobalKTable!
Here's the actual PR.
And the person that told me that was Matthias J. Sax ;)
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.