Stateful Kafka Stream - How to restore state?

Stateful Kafka Stream - How to restore state? - apache-kafka

If I am running my stream application - appA on Machine A and then I moved it to Machine B; will it remember the earlier state?
When I write simple consumer it remembers the last offset and it gets stored in __consumer_offsets itself on Broker. So no matter where I start the Consumer it will pick up from that place.
Is there such a construct for stateful stream processing applications? If I am calculating the continuous Profit and Loss of my portfolio I need to start from where it was the last run and then start aggregating new transactions to that earlier P&L number. I cannot afford to process all messages again from the start of time. I have been having a hard time in finding an article around this that explains how to solve this problem.

No, it won't remember state unless you move the statestore as well (state.dir configuration).
The changelog topic will need read from the earliest offsets to rebuild the state.
There's presentations about running Kafka Steams in Kubernetes that cover some aspects of this, since Kubernetes can stop and relocate its pods... But kubernetes also has volume management features that may not be available in your scenario.
It might therefore be best to run your job on both machines to start, then you have fault tolerance, high availability with a warm standby replica / partitioned state.

Related

How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency?

I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
The User Service, is used to update users' data in the system(address, First Name, phone number...).
The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
User X changed his address 1 year ago
the retention policy of the changelog is 6 months
today we add BillingService to the system.
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
We have a changelog with a retention policy of time T
A consumer service failed more than time T
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!

If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?

This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.
In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.
If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).
There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.
Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.
As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.

Why do we need to use Zookeeper for a Coordination Service instead of just a central database?

Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.

Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?

First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

Apache Camel Idempotent Repositories that support clustering

I am trying to implement a Camel Spring Boot application that is using FileComponent to poll on a directory. I also want to support clustering meaning multiple instances of this Camel-spring boot application could be started and consume from the directory.
I am trying to implement the IdempotentRepository on the File consumer with KafkaIdempotentRepository. However, when I start two instances at the same time, both of them consume a file coming into the directory and both instances broadcasts action:add for key my_file_name.
The configuration for the file component is the following:
file:incoming?readLock=idempotent&idempotentRepository=#myKafkaRepo&readLockLoggingLevel=WARN&shuffle=true
All the examples on clustered Idempotent Repository were with Hazelcast and for me it is difficult to impose on my users from operational reasons.
My question: does KafkaIdempotentRepository support clustered IdempotentRepository? If not which implementation would you suggest to use?

Kafka:: Apache Camel - IdempotentRepository Documentation
On startup, the instance subscribes to the topic and rewinds the offset to the beginning, rebuilding the cache to the latest state. The cache will not be considered warmed up until one poll of pollDurationMs in length returns 0 records. Startup will not be completed until either the cache has warmed up, or 30 seconds go by; if the latter happens the idempotent repository may be in an inconsistent state until its consumer catches up to the end of the topic.
My opinion
It depends how many recently processed records you need to remember and what the retention period of the topic will be.
If you can set the retention time of the topic is big enough that it satisfies your number of records to remember requirement but small enough for cache warm up can complete in much less than 30 seconds, go for it.

Curator TreeCache eventual consistency

When using Curator TreeCache. I understand that there is no guarantee for cache state to stay synchronous with the leader and that create/update/delete events can be missed (Zookeeper missed events on successive changes).
From what I understand however - TreeCache will be eventually consistent.
Question is: Is there any maximum (guaranteed) time defined in which the change in ZK node gets propagated to the TreeCache instance?

No there isn't a maximum time. Note: this has nothing to do with TreeCache it's merely how ZooKeeper works. Internally, all write operations go through the current leader node in your ZK ensemble. The "follower" nodes eventually synchronize with the leader's database. In practice, this will be a matter of seconds at the most but, of course, it depends entirely on the size of your database, your network, the number of operations in flight, etc.
Update: note that you configure your ZK instances with syncLimit which specifies the maximum db sync for your ensemble. ZooKeeper will timeout when syncLimit is exceeded. See here: https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse