Kafka streams state store distribution - apache-kafka

I have a kafka application that runs on multiple instances and I want to use state store for caching a few data fields. In case of multiple application instances if one instance goes down, does the local state store of one instance gets copied to other instance? What happens when the instance comes back? How are the state stores connected to the data keys for proper redistribution?

if one instance goes down, does the local state store of one instance gets copied to other instance?
If you don't have a standby replica, then the task will read the changelog topic from the beginning to rebuild the store, effectively making a copy, yes.
In the docs,
Starting in 2.6, Kafka Streams will guarantee that a task is only ever assigned to an instance with a fully caught-up local copy of the state, if such an instance exists. Standby tasks will increase the likelihood that a caught-up instance exists in the case of a failure
How are the state stores connected to the data keys for proper redistribution?
Partitions are mapped to task threads (refer same page).

Related

Kubernetes - How do I prevent duplication of work when there are multiple replicas of a service with a watcher?

I'm trying to build an event exporter as a toy project. It has a watcher that gets informed by the Kubernetes API every time an event, and as a simple case, let's assume that it wants to store the event in a database or something.
Having just one running instance is probably susceptible to failures, so ideally I'd like two. In this situation, the naive implementation would both instances trying to store the event in the database so it'd be duplicated.
What strategies are there to de-duplicate? Do I have to do it at the database level (say, by using some sort of eventId or hash of the event content) and accept the extra database load or is there a way to de-duplicate at the instance level, maybe built into the Kubernetes client code? Or do I need to implement some sort of leader election?
I assume this is a pretty common problem. Is there a more general term for this issue that I can search on to learn more?
I looked at the code for GKE event exporter as a reference but I was unable to find any de-duplication, so I assume that it happens on the receiving end.
You should use both leader election and de-duplication at your watcher level. Only one of them won't be enough.
Why need leader election?
If high availability is your main concern, you should have leader election between the watcher instances. Only the leader pod will write the event to the database. If you don't use leader election, the instances will race with each other to write into the database.
You may check if the event has been already written in the database and then write it. However, you can not guarantee that other instances won't write into the database between when you checked and when you write the event. In that case, database level lock / transaction might help.
Why need de-duplication?
Only leader election will not save you. You also need to implement de-duplication. If your leader pod restart, it will resync all the existing events. So, you should have a check whether to process the event or not.
Furthermore, if a failover happen, how you know from the new leader about which events were successfully exported by previous leader?

Kafka Streams : when will StateRestoreListener gets called?

I have attached my kafka streams instance with an implementation of org.apache.kafka.streams.processor.StateRestoreListener interface. I have used Processor API to define my streams and it has a state store. Basically the job of streams is to read data from an input topic and store the data in state store.
I want to know under what situations this StateRestoreListener gets invoked.
Does it get invoked when I start streams for the first time?
Does it get invoked when I start another instance?
Does it get invoked when I stop certain instance?
and in each case .. what methods get invoke?
Does it get invoked when I start streams for the first time?
No, because when you start for the first time the state store is empty and thus there is nothing to be restored.
Does it get invoked when I start another instance?
That would happen of part of the state is migrated to the newly started instance. The listener would be called on the new instance only, but not on the existing instance, because the existing instance does not need to restore anything.
Does it get invoked when I stop certain instance?
Yes, this is a similar case as the one above. The state of the stopped instance would be migrate to another (or multiple) still running instances, and those would restore the state and thus call the listener.
and in each case .. what methods get invoke?
All three methods would be invoked. The "start" and "end" callbacks once, and the third callback regularly during restore to report progress.

Kafka join storage

I use Kafka to join two streams with 3 days join window:
...
private final long retentionHours = Duration.ofDays(3);
...
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
.grace(Duration.ofMillis(0));
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
.withStoreName("STORE-1")
.withName("STORE-2");
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
With above implementation, I found that Kafka created state folder: /tmp/kafka-streams, (looks like RocksDB) and it grows constantly.
Also, state store in Kafka cluster grows constantly.
So, I changed streams join implementation to:
...
private final long retentionHours = Duration.ofDays(3);
...
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
.grace(Duration.ofMillis(0));
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
.withStoreName("STORE-1")
.withName("STORE-2")
.withThisStoreSupplier(createStoreSupplier("MEM-STORE-1"))
.withOtherStoreSupplier(createStoreSupplier("MEM-STORE-2"));
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
...
private WindowBytesStoreSupplier createStoreSupplier(String storeName) {
var window = Duration.ofMinutes(retentionHours * 2)
.toMillis();
return new InMemoryWindowBytesStoreSupplier(storeName, window, window, true);
}
Now, there is no state folder: /tmp/kafka-streams.
Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all?
If yes, how does it work?
Also, I still see that state store in Kafka cluster grows constantly.
Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all? If yes, how does it work?
IIRC, InMemoryWindowBytesStore doesn't use disk at all.
Generally speaking, a logical state store is in fact partitioned into multiple state store 'instances' (think: each stream task has its own, local state store instance). For the InMemoryWindowBytesStore specifically, and by design, these store instances manage all their local data in memory.
Also, I still see that state store in Kafka cluster grows constantly.
However, the InMemoryWindowBytesStore is still fault-tolerant. This is often confusing for new Kafka Streams developers because, in most software, "in memory" always implies "data is lost if something happens". This is not the case with Kafka Streams, however. A state store is always 'backed up' durably to its Kafka changelog topic, regardless of whether you use the default state store (with RocksDB) or the in-memory state store. This explains why you see the in-memory state's (changelog) data in the Kafka cluster. The data should not grow forever, btw, as changelog topics are compacted to prevent exactly this scenario.
Note: What can happen, however, when using the in-memory store is that your application instances could run out of memory (OOM), and thus crash. While your state data will never be lost, as explained above, your application will not be running due to the OOM crash / it will run only partially (some app instances run OOM, others do not). This OOM problem doesn't apply to the default store (RocksDB), as it manages its data on disk, and uses memory (RAM) only for caching purposes. But, again, this question of app availability is orthogonal to data safety (your data is safe regardless of whether your app is crashing or not).

AKKA.NET Journals and Snapshot Store

Since I have not seen any example of using AKKA.NET Journals and Snapshot store, I assume I have to use both type of actors to implement an Event Store and CQRS.
Is the Snapshot store expected to be updated every time when the actor state is changed, or should be set on a scheduled update like every 10 seconds?
Should the Snapshot store actors talk to the Journal actors only, so the actors having the state should not talk to Journals and Snapshot at the same time? I'm thinking in the line of SOC.
Assume I have to shut down the server and back up. A user tries to access a product (like computers) through a Web UI. At that time, the product actor does not exist in the actor system. To retrieve the state of the product, shouldn't I go to the snapshot store instead of running all the journals to recreate the state?
In Akka.Persistence both Journal and SnapshotStore are in fact actors used to abstract your actors from particular persistent provider. You almost never will have to use them directly - PersistentView and PersistentActor use them automatically under the hood.
Snapshot stores are only way to optimize speed of your actor recovery in case when your persistent actor has a loot of events to recover from. In distributed environment snapshotting without event sourcing is not a mean to achieve persistence . Good idea is to have counter which produces a snapshot after X events being processed by the persistent actor. Time-based updates have no sense - in many cases actor probably didn't changed over specified time. Performance is also bad (lot of unnecessary cycles).
SnapshotStores and Journals are unaware of each other. Akka.Persistence persistent actors have built-in recovering mechanism which handles actor's state recovery from SnapshtoStores and Journals and exposes methods to communicate with them.
As I said you'd probably don't want to communicate with snapshot stores and journals directly. This is what persistent actors/persistent views are for. Ofc you could probably just read actor state directly from backend storage, but the you should compare if there are no other events after latest saved snapshot etc. Recreation of persistent actor/view on different working node is IMO a better solution.

Pattern for a singleton application process using the database

I have a backend process that maintains state in a PostgreSQL database, which needs to be visible to the frontend. I want to:
Properly handle the backend being stopped and started. This alone is as simple as clearing out the backend state tables on startup.
Guard against multiple instances of the backend trampling each other. There should only be one backend process, but if I accidentally start a second instance, I want to make sure either the first instance is killed, or the second instance is blocked until the first instance dies.
Solutions I can think of include:
Exploit the fact that my backend process listens on a port. If a second instance of the process tries to start, it will fail with "Address already in use". I just have to make sure it does the listen step before connecting to the database and wiping out state tables.
Open a secondary connection and run the following:
BEGIN;
LOCK TABLE initech.backend_lock IN EXCLUSIVE MODE;
Note: the reason for IN EXCLUSIVE MODE is that LOCK defaults to the AccessExclusive locking mode. This conflicts with the AccessShare lock acquired by pg_dump.
Don't commit. Leave the table locked until the program dies.
What's a good pattern for maintaining a singleton backend process that maintains state in a PostgreSQL database? Ideally, I would acquire a lock for the duration of the connection, but LOCK TABLE cannot be used outside of a transaction.
Background
Consider an application with a "broker" process which talks to the database, and accepts connections from clients. Any time a client connects, the broker process adds an entry for it to the database. This provides two benefits:
The frontend can query the database to see what clients are connected.
When a row changes in another table called initech.objects, and clients need to know about it, I can create a trigger that generates a list of clients to notify of the change, writes it to a table, then uses NOTIFY to wake up the broker process.
Without the table of connected clients, the application has to figure out what clients to notify. In my case, this turned out to be quite messy: store a copy of the initech.objects table in memory, and any time a row changes, dispatch the old row and new row to handlers that check if the row changed and act if it did. To do it efficiently involves creating "indexes" against both the table-stored-in-memory, and handlers interested in row changes. I'm making a poor replica of SQL's indexing and querying capabilities in the broker program. I'd rather move this work to the database.
In summary, I want the broker process to maintain some of its state in the database. It vastly simplifies dispatching configuration changes to clients, but it requires that only one instance of the broker be connected to the database at a time.
it can be done by advisory locks
http://www.postgresql.org/docs/9.1/interactive/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
I solved this today in a way I thought was concise:
CREATE TYPE mutex as ENUM ('active');
CREATE TABLE singleton (status mutex DEFAULT 'active' NOT NULL UNIQUE);
Then your backend process tries to do this:
insert into singleton values ('active');
And quits or waits if it fails to do so.