Kafka join storage - apache-kafka

I use Kafka to join two streams with 3 days join window:
private final long retentionHours = Duration.ofDays(3);
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
With above implementation, I found that Kafka created state folder: /tmp/kafka-streams, (looks like RocksDB) and it grows constantly.
Also, state store in Kafka cluster grows constantly.
So, I changed streams join implementation to:
private final long retentionHours = Duration.ofDays(3);
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
private WindowBytesStoreSupplier createStoreSupplier(String storeName) {
var window = Duration.ofMinutes(retentionHours * 2)
return new InMemoryWindowBytesStoreSupplier(storeName, window, window, true);
Now, there is no state folder: /tmp/kafka-streams.
Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all?
If yes, how does it work?
Also, I still see that state store in Kafka cluster grows constantly.

Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all? If yes, how does it work?
IIRC, InMemoryWindowBytesStore doesn't use disk at all.
Generally speaking, a logical state store is in fact partitioned into multiple state store 'instances' (think: each stream task has its own, local state store instance). For the InMemoryWindowBytesStore specifically, and by design, these store instances manage all their local data in memory.
Also, I still see that state store in Kafka cluster grows constantly.
However, the InMemoryWindowBytesStore is still fault-tolerant. This is often confusing for new Kafka Streams developers because, in most software, "in memory" always implies "data is lost if something happens". This is not the case with Kafka Streams, however. A state store is always 'backed up' durably to its Kafka changelog topic, regardless of whether you use the default state store (with RocksDB) or the in-memory state store. This explains why you see the in-memory state's (changelog) data in the Kafka cluster. The data should not grow forever, btw, as changelog topics are compacted to prevent exactly this scenario.
Note: What can happen, however, when using the in-memory store is that your application instances could run out of memory (OOM), and thus crash. While your state data will never be lost, as explained above, your application will not be running due to the OOM crash / it will run only partially (some app instances run OOM, others do not). This OOM problem doesn't apply to the default store (RocksDB), as it manages its data on disk, and uses memory (RAM) only for caching purposes. But, again, this question of app availability is orthogonal to data safety (your data is safe regardless of whether your app is crashing or not).


Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
//do your things
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
//do your things
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

Kafka streams state store distribution

I have a kafka application that runs on multiple instances and I want to use state store for caching a few data fields. In case of multiple application instances if one instance goes down, does the local state store of one instance gets copied to other instance? What happens when the instance comes back? How are the state stores connected to the data keys for proper redistribution?
if one instance goes down, does the local state store of one instance gets copied to other instance?
If you don't have a standby replica, then the task will read the changelog topic from the beginning to rebuild the store, effectively making a copy, yes.
In the docs,
Starting in 2.6, Kafka Streams will guarantee that a task is only ever assigned to an instance with a fully caught-up local copy of the state, if such an instance exists. Standby tasks will increase the likelihood that a caught-up instance exists in the case of a failure
How are the state stores connected to the data keys for proper redistribution?
Partitions are mapped to task threads (refer same page).

Writing directly to a kafka state store

We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.

Kafka validate messages in stateful processing

I have an application where multiple users can send REST operations to modify the state of shared objects.
When an object is modified, then multiple actions will happen (DB, audit, logging...).
Not all the operations are valid for example you can not Modify an object after it was Deleted.
Using Kafka I was thinking about the following architecture:
Rest operations are queuing in a Kafka topic.
Operations to the same object are going to the same partition. So all the object's operations will be in sequence and processed by a consumer
Consumers are listening to a partition and validate the operation using an in-memory database
If the operation was valid then is sent to a "Valid operation topic" otherways is sent to an "Invalid operation topic"
Other consumers (db, log, audit) are listening to the "Valid operation topic"
I am not very sure about point number 3.
I don't like the idea to keep the state of all my objects. (I have billions of objects and even if an object can be of 10mb in size, what I need to store to validate its state is just few Kbytes...)
However, is this a common pattern? Otherwise how can you verify the validity of certain operations?
Also what would do you use as a in-memory database? Surely it has to be highly available, fault-tolerant and support transaction (read and write).
I believe this is a very valid pattern, and is essentially a variation to an event-sourced CQRS pattern.
For example, Lagom implements their CQRS persistence in a very similar fashion (although based on completely different toolset)
A few points:
you are right about the need for sequencial operations: since all your state mutations need to be based on the result of the previous mutation, there must be a strong order in their execution. This is very often the case for such things, so we like to be able to scale those operations horizontally as much as possible so that each of those sequences operations is happening in parallel to many other sequences. In your case we have one such sequence per shared object.
Relying on Kafka partitioning by key is a good way to achieve that (assuming you do not set max.in.flight.requests.per.connection higher than the default value 1). Here again Lagom has a similar approach by having their persistent entity distributed and single-threaded. I'm not saying Lagom is better, I'm just comforting you in the fact that is approach is used by others :)
a key aspect of your pattern is the transformation of a Command into an Event: in that jargon a command is seen as a request to impact the state and may be rejected for various reasons. An event is a description of a state update that happened in the past and is irrefutable from the point of view of those who receive it: a event always tells the truth. The process you are describing would be a controller that is at the boundary between the two: it is responsible for transforming commands into events.
In that sense the "Valid operation topic" you mention would be an event-sourced description of the state updates of your process. Since it's all backed by Kafka it would be arbitrarily partionable and thus scalable, which is awesome :)
Don't worry about the size of the sate of all your object, it must sit somewhere somehow. Since you have this controller that transforms the commands into events, this one becomes the primary source of truth related to that object, and this one is responsible for storing it: this controller handles the primary storage for your events, so you must cater space for it. You can use Kafka Streams's Key value store: those are local to each of your processing instance, though if you make them persistent they have no problem in handling data much bigger that the available RAM. Behind the scene data is spilled to disk thanks to RocksDB, and even more behind the scene it's all event-sourced to a kafka topic so your state store is replicated and will be transparently re-created on another machine if necessary
I hope this helps you finalise your design :)

Chanching the default presistence mode of mapwithstatedstream from Memory only to other modes

I have written spark streaming application using Kfka and mapwithsta functions. I have attched a snapshot of my application for the storage level
As you see the Kafka stream is serilized in both memory and disk..but I cant find a way to change the default presistence of the mapwithste internal streams..this the pice of code I am using
val messages=KafkaUtils.createDirectStream[String, String, (String,String)](ssc,
(r:org.apache.kafka.clients.consumer.ConsumerRecord[String,String]) =>(r.topic(),r.value()))
val mapped1=message.map(x=>(x._2.hashCode().toString(),x)).mapWithState(stateSpec1)
In my applications sates can become huge so I need to presiste the internal sates in emeory and disk..I would apprecite any help on this.
mapWithState is a distributed in-memory state store. It saves your state inside an internal structure called OpenHashMapBasedStateMap. What you're currently persisting is the KafkaRDD created by KafkaUtils.createDStream. If you're not iterating that same input twice, there's no need to persist it.
Remember that even if your internal state is huge, it should be evenly distributed inside your cluster. This means that you're not putting all your eggs in one basket, but spreading it throughout the cluster. If your state grows, you can always scale out your cluster with an additional node.