Sharing state between KStream applications in same consumer group with globalStateStore - apache-kafka

The current problem I am trying to solve is about sharing states between multiple applications in the same consumer group, they consume data from the same topic but different partition.
So I have an inputTopic with say 3 partitions. I will be running 3 KStream microservices (eg: MS1, MS2, MS3) of each partition, each microservice will process and the write result to an output topic.
Problem: most of the time the microservice can operate independently within its partition, but there are cases a microservice will need to pull the previous state of an attribute before it is able to process, and this state might previously be processed and stored by another microservice.
So an example would be if I have data of a guy walking on 3 section of a road. Each section represents a partition. So if this guy walk from section 1 to section 2, we are no longer publishing his state from section 1 publisher. His state is now published by section 2 publisher. And if I have a microservice to process his data per section. When I see records coming to section 2, I need to check if the guy's previous state whether he just started walking on my section or is he coming to my section from another section in order for me to continue process his data.
Proposed solution: I have been reading about globalStateStore, and it seems like it might solve my problem. So I will write down my thinking here and some questions, just wondering if you can see any problems in my approach:
Have the microservices read input topic from its assigned partition.
Have a GlobalStateStore to store the state so all 3 microservices can read it.
Since you can not write directly into the globalStateStore, I might have to create an intermediate topic to store the state (eg: <BobLocation,Long>; <BobMood,String>). The global state store will be created from this topic ("global-topic") - Is this correct?
Then everytime my microservice get a message I will always read the globalStateStore to update its state then process the record - Do I read it as a GlobalKTable?
Then update the state to the "global-topic"
Is there any implication on the restart process? As I am storing the state from the global state store all the time is there a problem when one app dies and the other one takes over?
Thank you so much guys!

Related

How does KStreams handle state store data when adding additional partitions?

I have one partition of data with one app instance and one local state store. It's been running for a time and has lots of stateful data. I need to update that to 5 partitions with 5 app instances. What happens to the one local state store when the partitions are added and the app is brought back online? Do I have to delete the local state store and start over? Will the state store be shuffled across the additional app instance state stores automatically according to the partitioning strategy?
Do I have to delete the local state store and start over?
That is the recommended way to handle it. (cf https://docs.confluent.io/platform/current/streams/developer-guide/app-reset-tool.html) As a matter of fact, if you change the number of input topic partitions and restart your application, Kafka Stream would fail with an error, because the state store has only one shard, while 5 shards would be expected given that you will have 5 input topic partitions now.
Will the state store be shuffled across the additional app instance state stores automatically according to the partitioning strategy?
No. Also note, that this also applies to your data in your input topic. Thus, if you plan to partition your input data by key (ie, when writing into the input topic upstream), old records would remain in the existing partition and thus would not be partitioned properly.
In general, it is recommended to over-partitions your input topics upfront, to avoid that you need to change the number of partitions later on. Thus, you might also consider to maybe go up to 10, or even 20 partitions instead of just 5.

Event Driven Architectures - Topics / Stream Design

This could be some kind of best practice question. Someone please who has worked on this clarify with examples. So that all of us could benefit!
For event-driven architectures with Kafka / Redis, when we create topics/streams for events, what are all the best practices to be followed.
Lets consider online order processing workflow.
I read some blogs saying that create topics/streams like order-created-events, order-deleted-events etc. But my question is how the order of the messages is guaranteed when we split this into multiple topics.
For ex:
order-created-events could have thousands of events and being slowly processed by a consumer. order-deleted-events could have only few records in the queue assuming only 5-10% would cancel the order.
Now, lets assume, an user first places an order. then he immediately cancels. This will make the order-deleted-event to process first as the topic/stream do not have much messages before some consumer processes order-created-event for the same order. It will cause some data inconsistency.
Hopefully my question is clear. So, how to come up with topics/streams design?
Kafka ensures sequencing for a particular partition only.
So, to take use of kafka partitioning and load balancing using partitions, multiple partitions for a single topic( like order) should be created.
Now, Use a partition class to generate a key for every message and that key should correspond to same partition only.
So , irrespective of Order A getting created , updated or deleted , they should always belong to same partition.
To properly achieve sequencing , this should be the basis of deciding topics , instead of 2 different topics for different activities.

Best practice at the moment of processing data with dependencies in Kafka?

We are developing an app that takes data from different sources and once the data is available we process it, put it together and then proceed to move it to a different topic.
In our case we have 3 topics and each of these topics are going to bring data which have a relation with data from a different topic, in this case, every entity generated could be or not received at the same time (or a short period of time), and this is when the problem comes because there is a need for joining this 3 entities into one before we proceed with the moving to the topic.
Our idea was to create a separate topic which is going to contain all the data that is not processed yet and then have a separate thread that is going to check that topic in fixed intervals and also check the dependencies of this topic to be available, if they are available then we delete this entity from this separate topic, if not, we kept this entity there until it gets resolved.
At the end of all this explanation my question is if is it reasonable to do it in this way or there are other good practices or strategies that Kafka provides to solve this kind of scenarios?
Kafka messages could get clean after some time based on retention policy so you need to store message somewhere:
I can see below option but always every problem have may approach and solution:
Processed all message and forward "not processed message" to other topic say A
Kafka Processor API to consume messages from topic A and store into the state store
Schedule a punctuate() method with a time interval
Iterate all messages stored in the state stored.
check dependency if available delete the message from the state store and processed it or publish back to original topics to get processed again.
You can refer below link for reference
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?
I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.

Looking for alternative to dynamically creating Kafka topics

I have a service that fetches a snapshot of some information about entities in our system and holds on to that for later processing. Currently in the later processing stages we fetch the information using http.
I want to use Kafka to store this information by dynamically creating topics so that the snapshots aren't mixed up with each other. When the service fetches the entities it creates a unique topic and then each entity we fetch gets pushed to that topic. The later processing stages would be passed the topic as a parameter and can then read all the info at their own leisure.
The benefits of this would be:
Restarting the later stages processing can be made to just restart at the offset it has processed so far.
No need to worry about batching of requests (or stream processing the incoming http response) for the entities if there is a lot of them since we simply read one at a time.
Multiple consumer groups can easily be added later for other processing purposes.
However, Kafka/Zookeeper has some limits on the total number of topics/partitions it can support. As such I would need to delete them either after the processing is done or based on some arbitrary time passing. In addition since (some) of the processors would have to know when all the information has been read I would need to include some sort of "End of Stream" message on the topic.
Two general questions:
Is it bad to dynamically create and delete Kafka topics like this?
Is it bad to include an "End of Stream" type of message?
Main question:
Is there an alternative to the above approach using static topics/partitions that doesn't entail having to hold onto the entities in memory until the processing should occur?
It seems that one “compacted” topic can be an alternative