Kafka streams state store for what? - apache-kafka

As I got right from book, Kafka Streams state store it is a memory key/value storage to store data to Kafka or after filtering.
I am confused by some theoretical questions.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
Why topic is not alternative for state storage?

Why topic is not alternative for state storage?
A topic contains messages in a sequential order that typically represents a log.
Sometimes, we would want to aggregate these messages, group them and perform an operation, like sum, for example and store it in a place which we can retrieve later using a key. In this case, an ideal solution would be to use a key-value store rather than a topic that is a log-structure.
What is real case to use state storage in Kafka Streams?
A simple use-case would be word count where we have a word and a counter of how many times it has occurred. You can see more examples at kafka-streams-examples on github.
What is difference between Kafka streams state from another memory storage like Redis etc?
State can be considered as a savepoint from where you can resume your data processing or it might also contain some useful information needed for further processing (like the previous word count which we need to increment), so it can be stored using Redis, RocksDB, Postgres etc.
Redis can be a plugin for Kafka streams state storage, however the default persistent state storage for Kafka streams is RocksDB.
Therefore, Redis is not an alternative to Kafka streams state but an alternative to Kafka streams' default RocksDB.

-Why topic is not alternative for state storage?
Topic is the final statestore storage under the hood (everything is topic in kafka)
If you create a microservice with name "myStream" and a statestore named "MyState", you'll see appear a myStream-MyState-changelog with has an history of all changes in the statestore.
RocksDB is only the local cache to improve performances, with a first layer of local backup on the local disk, but at the end the real high availability and exactly-once processing guarantee is provided by the underlying changelog topic.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
It not a storage, it's a just local, efficient, guaranteed memory state to manage some business case is a fully streamed way.
As an example :
For each Incoming Order (Topic1), i want to find any previous order (Topic2) to the same location in the last 6 hours

Related

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Can we share an application level cache between multiple Kafka Streams tasks

Let's say I have an in memory cache in a Kafka Streams application. the input topic has 2 partitions so for maximum parallelism I configure 1 streams application instance with 2 threads.
Within my stream processor, I make remote call to fetch some data and put it in a Map to cache it.
Since Kafka streams will assign 1 thread to each task and both tasks will try to update the cached map in parallel, do I have to take care of making the cached map thread safe? Is it not advisable to share an application level cache in an application instance that could be running multiple Kafka streams tasks?
I believe what you are looking for, is a GlobalKTable, which stores data from all the partitions. The way I see it is, you would need to make that remote call, push the result into a topic and then use that topic to create a GlobalKTable within the same app. GlobalKTable is backed by a RocksDB instance which stores data in your "local" file system, and can be queried using the key, much like how you would query a Map.
Word of caution: GlobalKTable source topics can get really huge and might impact your startup times if you aren't using a persistent file system, since the GlobalKTable needs to be hydrated with all the data on the "source" topic (this is done by GlobalStreamThread) before the app actually starts. So, you might want to configure compaction on the "source" topic.

Kafka Streams processor taking long time to consume changelog topics and initialize state stores

I'm working on a stream processor which has KStream-KStream and KStream-KTable join and also uses a state store remove the duplicates while doing the join.
We have been performing load tests for this processor and the messages in the topic are growing, which is causing the stream processor to take long time (~1 hour) to consume the changelog topics and initialize the state stores when there's a restart/redeployment happens.
We have a retention of 7 days for the topics.
There are multiple reasons for which this happens:
Your broker performance, i.e. how much data your KStream app can pull from each broker
Your KStream performance
Your serialization format (if you use Avro, the data size will be way smaller)
The solution to avoid expensive restarts is to have a persistent local state store. For example, you can map the default state store folder (/tmp/kafka-streams) to some sort of persistent volume

Join streaming based on key - Spark/Kafka

Suppose 2 streaming given by spark and one of streaming is not 100 % in sync. There might be difference in getting streaming. We need to join streaming by key. Is there any way we can do it without any persistent ?
I don't think it is possible, Kafka Streams ships with built-in support to interpret the data in a Kafka topic as such a continuously updated table. In the Kafka Streams DSL this is achieved via the so-called Ktable
these KTables are backed by state stores in Kafka Streams. These state stores are local to your application (more precisely: they are local to the instances of your application, of which there can be one or many), which means that interacting with these state stores does not require talking over the network, so read and write operations are very fast. Incase you decided not to persist data, you might start losing information which you might not want

Is Kafka Stream StateStore global over all instances or just local?

In Kafka Stream WordCount example, it uses StateStore to store word counts. If there are multiple instances in the same consumer group, the StateStore is global to the group, or just local to an consumer instance?
Thnaks
This depends on your view on a state store.
In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.
On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.
Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.
Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).
For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html
Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
If worth mentioning that there is a GlobalKTable improvement proposal
GlobalKTable will be fully replicated once per KafkaStreams instance.
That is, each KafkaStreams instance will consume all partitions of the
corresponding topic.
From the Confluent Platform's mailing list, I've got this information
You could start
prototyping using Kafka 0.10.2 (or trunk) branch...
0.10.2-rc0 already has GlobalKTable!
Here's the actual PR.
And the person that told me that was Matthias J. Sax ;)
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.