Kafka Streams Processor API clear state store - apache-kafka

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks

You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

Related

What happens to the consumer offset when an error occurs within a custom class in a KStream topology?

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

Kafka + Streams as Event Store in a CQRS application - Command Model consistency

I've been reading a few articles about using Kafka and Kafka Streams (with state store) as Event Store implementation.
https://www.confluent.io/blog/event-sourcing-using-apache-kafka/
https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
The implementation idea is the following:
Store entity changes (events) in a kafka topic
Use Kafka streams with state store (by default uses RethinkDB) to update and cache the entity snapshot
Whenever a new Command is being executed, get the entity from the store execute the operation on it and continue with step #1
The issue with this workflow is that the State Store is being updated asynchronously (step 2) and when a new command is being processed the retrieved entity snapshot might be stale (as it was not updated with events from previous commands).
Is my understanding correct? Is there a simple way to handle such case with kafka?
Is my understanding correct?
As far as I have been able to tell, yes -- which means that it is an unsatisfactory event store for many event-sourced domain models.
In short, there's no support for "first writer wins" when adding events to a topic, which means that Kafka doesn't help you ensure that the topic satisfies its invariants.
There have been proposals/tickets to address this, but I haven't found evidence of progress.
https://issues.apache.org/jira/browse/KAFKA-2260
https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish
Yes it's simple way.
Use key for Kafka message. Messages with the same key always* go the the same partition.
One consumer can read from one or many portions, but two partitions can not be read by two consumer simultaneously.
Max count of working consumer is always <= count of partition for a topic. You can create more consumer but consumer will be backup nodes.
Something like example:
Assumption.
There is a kafka topic abc with partitions p0,p1.
There is consumer C1 consuming from p0, and consumer C2 consuming from p1. Consumers are working asynchronicity
km(key,command) - kafka message.
#Procedure creating message
km(key1,add) -> p0
km(key2,add) -> p1
km(key1,edit) -> p0
km(key3,add) -> p1
km(key3,edit) -> p1
#consumer C1 will read messages km(key1,add), km(key1,edit) and order will be persist
#consumer c2 will read messages km(key2,add) , km(key3,add) ,km(key3,edit)
If you write commands to Kafka then materialize a view in KStreams the materialized view will be updated asynchronously. This helps you separate writes from reads so the read path can scale.
If you want consistent read-write semantics over your commands/events you might be better writing to a database. Events can either be extracted from the database into Kafka using a CDC connector (write-through) or you can write to the database and then to Kafka in a transaction (write-aside).
Another option is to implement long polling on the read (so if you write trade1.version2 then want to read it again the read will block until trade1.version2 is available). This isn't suitable for all use cases but it can be useful.
Example here: https://github.com/confluentinc/kafka-streams-examples/blob/4eb3aa4cc9481562749984760de159b68c922a8f/src/main/java/io/confluent/examples/streams/microservices/OrdersService.java#L165
The Command Pattern that you want to implement is already a part of the Akka Framework, I don't know you have experience with the framework or not but I strongly advice you to look there before you implement your own solution.
Also for the amount of Events that we receive in todays IT, I also advice to integrate it with a State Machine.
If you like to see how can we put all together please check my blog :)

kafka produce to topic and write to state store in a single transaction

Is it possible to produce to a Kafka topic and write to state store in a single transaction? But not start the transaction as part of a topic consumption.
EDIT: The reason I what to do this is to be able to filter out duplicate requests. E.g. a service exposes a REST interface and just writes a message to a topic. If it is possible to produce to topic and write to state store in a single transaction, then I can easily first query the state store to filter out a duplicate. This also assumes that the transaction timeout, will be less than the REST timeout, but not that related to the question.
I am also aware of the solution provided here by Confluent. But this will work as long as the synchronisation time "from the topic to the store" is less than the blocking time.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/processor/StateStore.html
State store is part of Streams API. So, State store is linked with Kafka-streams. I would recommend using headers within a message to maintain state information.
Or
Create another topic to store intermediate information.
If I understand you use case properly, you can do like that:
Write REST call result to some topic - raw-data(using the producer)
Use Kafka Streams to process data from raw-data topic. Using Kafka Streams you can implement whole logic of checking/filtering duplicates, etc and writing result into golden topic.

Kafka Streams - Processor API - Forward to different topics

I have a Processor-API Processor, which internally forwards to several separate sinks (think of an event classifier, although it also has stateful logic between the events). I was thinking of having a join later between two of those topics. Once a join is made, I forward an updated (enriched) version of the elements to those topics I'm actually joining.
How would you mix DSL if in your Processor API code you forward to more than one sink(sink1, sink2) that in turn are sent to topics?
I guess you could you create separate streams, like
val stream1 = builder.stream(outputTopic)
val stream2 = builder.stream(outputTopic2)
and build from there? However this creates more subtopologies - which are the implications here?
Another possibility is to have your own state store in the Processor API and manage it there, in the same Processor (I'm actually doing that). It adds complexity to the code, but wouldn't it be more efficient? For example, you can delete data you no longer use (once a join is made, you can forward new joined data to sinks and it is no longer eligible for a join). Any other efficiency gotcha?
The simplest way might be to mix Processor API with the DSL by starting with a StreamsBuilder and use transform()
StreamsBuilder builder = new StreamsBuilder()
KStream[] streams = builder.stream("input-topic")
.transform(/* put your processor API code here */)
.branch(...);
KStream joined = streams[0].join(streams[1], ...);
Writing the intermediate streams into topic first and read them back is also possible. The fact that you get more sub-topologies should be of no concern.
Doing the join manually via states is possible but hard to code correctly. If possible, I would recommend to use the provided join operator from the DSL.