Kafka + Streams as Event Store in a CQRS application - Command Model consistency - apache-kafka

I've been reading a few articles about using Kafka and Kafka Streams (with state store) as Event Store implementation.
https://www.confluent.io/blog/event-sourcing-using-apache-kafka/
https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
The implementation idea is the following:
Store entity changes (events) in a kafka topic
Use Kafka streams with state store (by default uses RethinkDB) to update and cache the entity snapshot
Whenever a new Command is being executed, get the entity from the store execute the operation on it and continue with step #1
The issue with this workflow is that the State Store is being updated asynchronously (step 2) and when a new command is being processed the retrieved entity snapshot might be stale (as it was not updated with events from previous commands).
Is my understanding correct? Is there a simple way to handle such case with kafka?

Is my understanding correct?
As far as I have been able to tell, yes -- which means that it is an unsatisfactory event store for many event-sourced domain models.
In short, there's no support for "first writer wins" when adding events to a topic, which means that Kafka doesn't help you ensure that the topic satisfies its invariants.
There have been proposals/tickets to address this, but I haven't found evidence of progress.
https://issues.apache.org/jira/browse/KAFKA-2260
https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish

Yes it's simple way.
Use key for Kafka message. Messages with the same key always* go the the same partition.
One consumer can read from one or many portions, but two partitions can not be read by two consumer simultaneously.
Max count of working consumer is always <= count of partition for a topic. You can create more consumer but consumer will be backup nodes.
Something like example:
Assumption.
There is a kafka topic abc with partitions p0,p1.
There is consumer C1 consuming from p0, and consumer C2 consuming from p1. Consumers are working asynchronicity
km(key,command) - kafka message.
#Procedure creating message
km(key1,add) -> p0
km(key2,add) -> p1
km(key1,edit) -> p0
km(key3,add) -> p1
km(key3,edit) -> p1
#consumer C1 will read messages km(key1,add), km(key1,edit) and order will be persist
#consumer c2 will read messages km(key2,add) , km(key3,add) ,km(key3,edit)

If you write commands to Kafka then materialize a view in KStreams the materialized view will be updated asynchronously. This helps you separate writes from reads so the read path can scale.
If you want consistent read-write semantics over your commands/events you might be better writing to a database. Events can either be extracted from the database into Kafka using a CDC connector (write-through) or you can write to the database and then to Kafka in a transaction (write-aside).
Another option is to implement long polling on the read (so if you write trade1.version2 then want to read it again the read will block until trade1.version2 is available). This isn't suitable for all use cases but it can be useful.
Example here: https://github.com/confluentinc/kafka-streams-examples/blob/4eb3aa4cc9481562749984760de159b68c922a8f/src/main/java/io/confluent/examples/streams/microservices/OrdersService.java#L165

The Command Pattern that you want to implement is already a part of the Akka Framework, I don't know you have experience with the framework or not but I strongly advice you to look there before you implement your own solution.
Also for the amount of Events that we receive in todays IT, I also advice to integrate it with a State Machine.
If you like to see how can we put all together please check my blog :)

Related

Instruct Kafka Consumer App To Start Reading From Offset

If I have an application AppA that contains a Kafka consumer class, is it possible to instruct this consumer's behaviour pragmatically? For example, I may want to tell AppA over a rest API (or even via another topic) to wake up and begin consuming and processing messages from TopicB at offset or timestamp X and to stop at offset or timestamp Y. I may tell it to read the same sections of a topic repeatedly to perform different analysis of the data and I might want the consumer to sit idle when it's not performing an instruction.
Is it possible to control a consumer in this fashion? Essentially, I'm interested to know if I can read sections of topics on demand to produce processing/reports on its contents.. kind of in a similar to way to querying a relational DB via an admin console I guess.
Thanks in advance!
The Kafka consumer is able to consume topics at arbitrary positions.
You can use the seek() method to start consuming from a specific offset. You can also use the offsetsForTimes() method to find the offsets for a specific timestamp.
You can combine these two methods to consume specific sections of topics on demand.

Processing Unprocessed Records in Kafka on Recovery/Rebalance

I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)

Best practice at the moment of processing data with dependencies in Kafka?

We are developing an app that takes data from different sources and once the data is available we process it, put it together and then proceed to move it to a different topic.
In our case we have 3 topics and each of these topics are going to bring data which have a relation with data from a different topic, in this case, every entity generated could be or not received at the same time (or a short period of time), and this is when the problem comes because there is a need for joining this 3 entities into one before we proceed with the moving to the topic.
Our idea was to create a separate topic which is going to contain all the data that is not processed yet and then have a separate thread that is going to check that topic in fixed intervals and also check the dependencies of this topic to be available, if they are available then we delete this entity from this separate topic, if not, we kept this entity there until it gets resolved.
At the end of all this explanation my question is if is it reasonable to do it in this way or there are other good practices or strategies that Kafka provides to solve this kind of scenarios?
Kafka messages could get clean after some time based on retention policy so you need to store message somewhere:
I can see below option but always every problem have may approach and solution:
Processed all message and forward "not processed message" to other topic say A
Kafka Processor API to consume messages from topic A and store into the state store
Schedule a punctuate() method with a time interval
Iterate all messages stored in the state stored.
check dependency if available delete the message from the state store and processed it or publish back to original topics to get processed again.
You can refer below link for reference
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html

kafka produce to topic and write to state store in a single transaction

Is it possible to produce to a Kafka topic and write to state store in a single transaction? But not start the transaction as part of a topic consumption.
EDIT: The reason I what to do this is to be able to filter out duplicate requests. E.g. a service exposes a REST interface and just writes a message to a topic. If it is possible to produce to topic and write to state store in a single transaction, then I can easily first query the state store to filter out a duplicate. This also assumes that the transaction timeout, will be less than the REST timeout, but not that related to the question.
I am also aware of the solution provided here by Confluent. But this will work as long as the synchronisation time "from the topic to the store" is less than the blocking time.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/processor/StateStore.html
State store is part of Streams API. So, State store is linked with Kafka-streams. I would recommend using headers within a message to maintain state information.
Or
Create another topic to store intermediate information.
If I understand you use case properly, you can do like that:
Write REST call result to some topic - raw-data(using the producer)
Use Kafka Streams to process data from raw-data topic. Using Kafka Streams you can implement whole logic of checking/filtering duplicates, etc and writing result into golden topic.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.