What's a viable solution to deferred processing of the events? - apache-kafka

Given the system, which is consuming the event stream from Kafka in order to analyze some records stored in the database.
In some cases, the event matches some condition that means, that corresponding record should be analyzed later in the future.
Perhaps, the most simple solution to implement this logic is to write timestamp of future processing to the database and periodically perform some kind of select to find required records for re-processing.
Maybe there is another more convenient and scalable way to do it? It looks like another timestamped event stream which could be processed when the current time become greater or equal than timestamp of event, what's the options to implement such behavior?

In my opinion depending on how long you need to store it, you can just create a stream that filter for these events and push it into a new topic that can be processed later. If it is more for historical purpose then it might be better to push it into a DBMS.

You can try state store in Kafka Stream. Which can be used by stream processing applications to store and query data in later.
Kafka Stream automatically creates and manages such state stores when you are calling stateful operators such as count() or aggregate(), or when you are windowing a stream. It will be store in In-memory however you can store in somewhere persistent storage e.g. portworx to handle fault scenario.
Below show how you initialize StateStore
StoreBuilder<KeyValueStore<String, String>> statStore = Stores
.keyValueStoreBuilder(Stores.persistentKeyValueStore("uniqueName"), Serdes.String(),
Serdes.String())
.withLoggingDisabled(); // disable backing up the store to a change log topic
Below show how to add state store inside Kafka Stream
Topology builder = new Topology();
builder.addSource("Source", topic)
.addProcessor("SourceProcessName", () -> new ProcessorClass(), "Source")
.addStateStore(statStore, "SourceProcessName")
.addSink("SinkProcessName", sinkTopic, "SourceProcessName");
In the Process Method, You can store Kafka topic message as key, value
KeyValueStore<String, String> dsStore = (KeyValueStore<String, String>) context.getStateStore("statStore");
KeyValueIterator<String, String> iter = this.dsStore.all();
while (iter.hasNext()) {
KeyValue<String, String> entry = iter.next();
}

Related

Kafka Ktable changelog (using toStream()) is missing some ktable updates when several messages with the same key arrive at the same time

I have an input stream, I use it to create a ktable. Then I create an output stream with the ktable changelog, using toStream() method. The problem is that the stream created by the toStream() method does not contains all the messages from the input stream that has updated my KTable. Here is my code :
final KTable<String, event> KTable = inputStream.groupByKey().aggregate(() -> null,
aggregateKtableMethod,
storageConf);
KStream<String, event> outputStream = KTable.toStream();
I would like to get one message in the outputStream for each message in inputStream. For most of the messages it is working well, but I am losing some events in a particular case : if I receive 2 messages with the same key in a small interval of time (less than 5 seconds). In this case I only receive the second event in the outputStream.
I think it is because the Ktable updates are made by some batch operations, but I can't find any configuration or documentation related to it. Is it the reason of these missing events and do you know how to change the configuration so that I will not lose any message ?
I found the solution. The issue was in the "storageConf" I have used to create my ktable, the cache was able. I just had to disabled it, with the function :
storageConf.withCachingDisabled();
final KTable<String, event> KTable = inputStream.groupByKey().aggregate(() -> null,
aggregateKtableMethod,
storageConf);
Now I have all my events in the output stream.

Kafka Streams Processor API clear state store

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

Maintain separate KTable

I have a topic which contains events of user connection and disconnection for each session. I would like to use Kafka stream to process this topic and update KTable based on some condition. Each record cannot update KTable. So I need to process multiple records to know if KTable has to be updated.
For eg, process stream and aggregate by user and then by sessionid. If atleast one sessionid of that user has only Connected event, KTable must be updated as user online if not already.
If all sessionId of the user has Disconnected event, KTable must be updated as user offline if not already.
How can I implement such a logic?
Can we implement this KTable in all application instances so that each instance has this data available locally?
Sounds like a rather complex scenario.
Maybe, it's best to use the Processor API for this case? A KTable is basically just a KV-store, and using the Processor API, allows you to apply complex processing to decide if you want to update the state store or not. A KTable itself does not allow you to apply complex logic but it will apply each update it receives.
Thus, using the DSL, you would need to do some per-processing, and if you want to update a KTable send an update record only for this case. Something like this:
KStream stream = builder.stream("input-topic");
// apply your processing and write an update record into `updates` when necessary
KStream updates = stream...
KTable table = updates.toTable();

Can I empty the local kafka state store

Currently I have 3 kafka brokers with 150 partitions.
I also have 3 consumers that each one is assigned to a group of partitions.
Each consumer has its own local state store with rocksdb. This in-memory key-value store is called during grpc calls. During rebalancing (if a consumer disappears) then the data is written in the local store of the other consumers.
If the consumers are running for around 2 weeks it seems that the services are running out of memory.
Is there a solution to the local storage growing too much? Can we remove data of partitions that are not needed anymore? Or is there a way to remove the stored data after the consumer is restored?
you can use the cleanUp(); method while starting or shut down Kafka Stream to cleanup state storage.
cleanUp()
Do a clean up of the local StateStore by deleting all data with regard
to the application ID. May only be called either before this
KafkaStreams instance is started in with calling start() method or
after the instance is closed by calling close() method.
KafkaStreams app = new KafkaStreams(builder.build(), props);
// Delete the application's local state.
// Note: In real application you'd call `cleanUp()` only under
// certain conditions. See tip on `cleanUp()` below.
app.cleanUp();
app.start();
Note: To avoid the corresponding recovery overhead, you should not call
cleanUp() by default but only if you really need to. Otherwise, you wipe out local state and trigger an expensive state restoration. You
won't lose data and the program will still be correct, but you may
slow down startup significantly (depending on the size of your state)
In case you are looking to delete from state store during your life cycle of Kafka Stream, you can very well remove from state store after all its just collection of map store in rocks B
Assume you are using Kafka Stream Processor
KeyValueStore<String, String> dsStore=(KeyValueStore<String, String>) context.getStateStore("localstorename");
KeyValueIterator<String, String> iter = this.dsStore.all();
while (iter.hasNext()) {
KeyValue<String, String> entry = iter.next();
dsStore.delete(entry.key);
}

How to store only latest key values in a kafka topic

I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
I thought a KTable's whole purpose was that it will store the latest value given a key rather than storing the whole stream of events. However I can't seem to get this to work. Running the code below produces the keystore but that keystore (maintopiclatest) has a stream of events in it (not just the latest values). So if I send a request with 1000 records in the topic twice, rather than seeing 1000 records, I see 2000 records.
var serializer = new KafkaSpecificRecordSerializer();
var deserializer = new KafkaSpecificRecordDeserializer();
var stream = kStreamBuilder.stream("maintopic",
Consumed.with(Serdes.String(), Serdes.serdeFrom(serializer, deserializer)));
var table = stream
.groupByKey()
.reduce((aggV, newV) -> newV, Materialized.as("maintopiclatest"));
The other problem is if I want to store the KTable in a new topic I'm not sure how to do that. In order to do that it seems that I have to turn it back into a Stream so that I can call ".to" on it. But then that has the whole stream of events in it not just the latest values.
This is not how a KTable works.
A KTable itself, has an internal state store and stores exactly one record per key. However, a KTable is constantly updated and subject to the so-called stream-table-duality. Each update to the KTable is sent downstream as a changelog record: https://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables. Thus, each input record result in an output record.
Because it's stream processing, there is no "last key per value".
I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
At which point in time do you want a KTable to emit an update? There is no answer to this question because the input stream is conceptually infinite.