Can we share an application level cache between multiple Kafka Streams tasks - apache-kafka

Let's say I have an in memory cache in a Kafka Streams application. the input topic has 2 partitions so for maximum parallelism I configure 1 streams application instance with 2 threads.
Within my stream processor, I make remote call to fetch some data and put it in a Map to cache it.
Since Kafka streams will assign 1 thread to each task and both tasks will try to update the cached map in parallel, do I have to take care of making the cached map thread safe? Is it not advisable to share an application level cache in an application instance that could be running multiple Kafka streams tasks?

I believe what you are looking for, is a GlobalKTable, which stores data from all the partitions. The way I see it is, you would need to make that remote call, push the result into a topic and then use that topic to create a GlobalKTable within the same app. GlobalKTable is backed by a RocksDB instance which stores data in your "local" file system, and can be queried using the key, much like how you would query a Map.
Word of caution: GlobalKTable source topics can get really huge and might impact your startup times if you aren't using a persistent file system, since the GlobalKTable needs to be hydrated with all the data on the "source" topic (this is done by GlobalStreamThread) before the app actually starts. So, you might want to configure compaction on the "source" topic.

Related

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Kafka streams state store for what?

As I got right from book, Kafka Streams state store it is a memory key/value storage to store data to Kafka or after filtering.
I am confused by some theoretical questions.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
Why topic is not alternative for state storage?
Why topic is not alternative for state storage?
A topic contains messages in a sequential order that typically represents a log.
Sometimes, we would want to aggregate these messages, group them and perform an operation, like sum, for example and store it in a place which we can retrieve later using a key. In this case, an ideal solution would be to use a key-value store rather than a topic that is a log-structure.
What is real case to use state storage in Kafka Streams?
A simple use-case would be word count where we have a word and a counter of how many times it has occurred. You can see more examples at kafka-streams-examples on github.
What is difference between Kafka streams state from another memory storage like Redis etc?
State can be considered as a savepoint from where you can resume your data processing or it might also contain some useful information needed for further processing (like the previous word count which we need to increment), so it can be stored using Redis, RocksDB, Postgres etc.
Redis can be a plugin for Kafka streams state storage, however the default persistent state storage for Kafka streams is RocksDB.
Therefore, Redis is not an alternative to Kafka streams state but an alternative to Kafka streams' default RocksDB.
-Why topic is not alternative for state storage?
Topic is the final statestore storage under the hood (everything is topic in kafka)
If you create a microservice with name "myStream" and a statestore named "MyState", you'll see appear a myStream-MyState-changelog with has an history of all changes in the statestore.
RocksDB is only the local cache to improve performances, with a first layer of local backup on the local disk, but at the end the real high availability and exactly-once processing guarantee is provided by the underlying changelog topic.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
It not a storage, it's a just local, efficient, guaranteed memory state to manage some business case is a fully streamed way.
As an example :
For each Incoming Order (Topic1), i want to find any previous order (Topic2) to the same location in the last 6 hours

Kafka: topic compaction notification?

I was given the following architecture that I'm trying to improve.
I receive a stream of DB changes which end up in a compacted topic. The stream is basically key/value pairs and the keyspace is large (~4 GB).
The topic is consumed by one kafka stream process that stores the data in RockDB (separate for each consumer/shard). The processor does two different things:
join the data into another stream.
check if a message from the topic is a new key or an update to an existing one. If it is an update it sends the old key/value and the new key/value pair to a different topic (updates are rare).
The construct has a couple of problems:
The two different functionalities of the stream processor belong to different teams and should not be part of the same code base. They are put together to save memory. If we separate it we would have to duplicate RockDB's.
I would prefer to use a normal KTable join instead of the handcrafted join that's currently in the code.
RockDB seems to be a bit of overkill if the data is already persisted in a topic. We currently running into some performance issues and I assume it would be faster if we just keep everything in memory.
Question 1:
Is there a way to hook into the compaction process of a compacted topic? I would like a notification (to a different topic) for every key that is actually compacted (including the old and new value).
If this is somehow possible I could easily split the code bases apart and simplify the join.
Question 2:
Any other idea on how this can be solved more elegantly?
You overall design makes sense.
About your join semantics: I guess you need to stick with Processor API as regular KTable cannot provide you want. It's also not possible to hook into the compaction process.
However, Kafka Streams also supports in-memory state stores: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html#state-stores
RocksDB is used by default, to allow the state to be larger than available main-memory. Spilling to disk with RocksDB to reliability -- however, it also has the advantage, that stores can be recreated quicker if an instance come back online on the same machine, as it's not required to re-read the whole changelog topic.
If you want to split the app into two, is your own decision on how much resources you want to provide.

Kafka Streams - all instances local store pointing to the same topic

We have the following problem:
We want to listen on certain Kafka topic and build it's "history" - so for specified key extract some data, add it to already existing list for that key (or create a new one if it does not exist) an put it to another topic, which has only single partition and is highly compacted. Another app can just listen on that topic and update it's history list.
I'm thinking how does it fit with Kafka streams library. We can certainly use aggregation:
msgReceived.map((key, word) -> new KeyValue<>(key, word))
.groupBy((k,v) -> k, stringSerde, stringSerde)
.aggregate(String::new,
(k, v, stockTransactionCollector) -> stockTransactionCollector + "|" + v,
stringSerde, "summaries2")
.to(stringSerde, stringSerde, "transaction-summary50");
which creates a local store backed by Kafka and use it as history table.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Question is, if I just set the same applicationId for each running instance, should it eventually replay all data from the very same kafka topic that each running instance has the same local state?
Why would you create multiple apps with different ID's to perform the same job? The way Kafka achieves parallelism is through tasks:
An application’s processor topology is scaled by breaking it into multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based on the input stream partitions for the application, with each task assigned a list of partitions from the input streams (i.e., Kafka topics). The assignment of partitions to tasks never changes so that each task is a fixed unit of parallelism of the application.
Tasks can then instantiate their own processor topology based on the assigned partitions; they also maintain a buffer for each of its assigned partitions and process messages one-at-a-time from these record buffers. As a result stream tasks can be processed independently and in parallel without manual intervention.
If you need to scale your app, you can start new instances running the same app (same application ID), and some of the already assigned tasks will reassigned to the new instance. The migration of the local state stores will be automatically handled by the library:
When the re-assignment occurs, some partitions – and hence their corresponding tasks including any local state stores – will be “migrated” from the existing threads to the newly added threads. As a result, Kafka Streams has effectively rebalanced the workload among instances of the application at the granularity of Kafka topic partitions.
I recommend you to have a look to this guide.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Some assumptions are not correct:
if you run multiple instances of your application to scale your app, all of them must have the same application ID (cf. Kafka's consumer group management protocol) -- otherwise, load will not be shared because each instance will be considered an own application, and each instance will get all partitions assigned.
Thus, if all instanced do use the same application ID, all running application instance will use the same changelog topic name and thus, what you intend to do, should work out-of-the box.

Is Kafka Stream StateStore global over all instances or just local?

In Kafka Stream WordCount example, it uses StateStore to store word counts. If there are multiple instances in the same consumer group, the StateStore is global to the group, or just local to an consumer instance?
Thnaks
This depends on your view on a state store.
In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.
On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.
Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.
Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).
For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html
Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
If worth mentioning that there is a GlobalKTable improvement proposal
GlobalKTable will be fully replicated once per KafkaStreams instance.
That is, each KafkaStreams instance will consume all partitions of the
corresponding topic.
From the Confluent Platform's mailing list, I've got this information
You could start
prototyping using Kafka 0.10.2 (or trunk) branch...
0.10.2-rc0 already has GlobalKTable!
Here's the actual PR.
And the person that told me that was Matthias J. Sax ;)
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.