Maintain separate KTable - apache-kafka

I have a topic which contains events of user connection and disconnection for each session. I would like to use Kafka stream to process this topic and update KTable based on some condition. Each record cannot update KTable. So I need to process multiple records to know if KTable has to be updated.
For eg, process stream and aggregate by user and then by sessionid. If atleast one sessionid of that user has only Connected event, KTable must be updated as user online if not already.
If all sessionId of the user has Disconnected event, KTable must be updated as user offline if not already.
How can I implement such a logic?
Can we implement this KTable in all application instances so that each instance has this data available locally?

Sounds like a rather complex scenario.
Maybe, it's best to use the Processor API for this case? A KTable is basically just a KV-store, and using the Processor API, allows you to apply complex processing to decide if you want to update the state store or not. A KTable itself does not allow you to apply complex logic but it will apply each update it receives.
Thus, using the DSL, you would need to do some per-processing, and if you want to update a KTable send an update record only for this case. Something like this:
KStream stream = builder.stream("input-topic");
// apply your processing and write an update record into `updates` when necessary
KStream updates = stream...
KTable table = updates.toTable();

Related

Reprocess enriched kstream data on insert/updates of KTable

Here is a hypothetical but analogous scenario to what I am trying to achieve using Kafka Streams.
I have streaming data, sales, that I want to enrich with infrequently changing lookup data, say users & items, which I am planning to create KTable for. I plan to push this enriched data to a topic and to search engine using connect sink.
How do I ensure that updates in user/item data triggers enrichments for past sales data as well, not only the new data that is ingested in stream. As I understand, KTable inserts/updates doesn't trigger any reprocessing of past data of the stream.
I believe this may be a common use-case, at least I may not be the first one to have such need. Any guidance for solution or workarounds?
If you want to update old data, it implies that you want to do a table-table join. Note though, that for this case, all data of both inputs need to be hold on the application.

Kafka Streams Processor API: Write directly to kafka statestore and then to topic to avoid latency

I have an application which produces to a topic. In the same application, I query its materialized state store for getting the latest value for a particular key.
I have to immediately query the state store as soon as the data is produced to the topic. Most of the time, the interactive query returns old value as it would take some time for updating the value in state store.
I am thinking of changing the architecture and I would like to know if it possible to write directly to state store first and then to final topic so that I can get the updated value as soon as it is available. If this is possible and say I have multiple instances of the same application, how can I query the local state store of other instances as I might not have a source stream?
I know interactive queries work on top of a Kafka stream with source topic. But in my case, if I directly write to a statestore, will rpc work?How can I achieve it?
Any help is much appreciated.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Cost of Kstream Vs cost of KTable with respect to the state store

I am trying to better understand how to set up my cluster for running my Kafka-Stream application. I'm trying to have a better sense of the volume of data that will be involve.
In that regard, while I can quickly see that a KTable require a state store, I wonder if creating a Kstream from a topics, immediately means copping all the log of that topic into the state store obviously in an append only fashion I suppose. That is, especially if we want to expose the stream for query ?
Does Kafka automatically replicate the Data in the state store as they move in the source topic, when it is a Kstream ? As said above this sounds obvious for Ktable because of the update, but for Kstream I just want a confirmation of what happens.
State Stores are created whenever any stateful operation is called or while windowing stream.
You are right that KTable requires a state store. KTable is an abstraction of changelog stream where each record represents an update. Internally it is implemented using RocksDB where all the updated values are stored in the state store and a changelog topic. At any time, state store can be rebuilt from changelog topic.
While KStream has a different concept, it represents abstraction on record stream with the unbounded dataset in append-only format. It doesn't create any state store while reading a source topic.
Unless, you want to see the updated changelog, it is okay to use KStream instead of KTable as it avoids creating unwanted state store. KTables are always expensive as compared to KStreams. Also it depends on how you want to use the data.
If you want to expose the stream for query, you need to materialize the stream into state store.

kafka produce to topic and write to state store in a single transaction

Is it possible to produce to a Kafka topic and write to state store in a single transaction? But not start the transaction as part of a topic consumption.
EDIT: The reason I what to do this is to be able to filter out duplicate requests. E.g. a service exposes a REST interface and just writes a message to a topic. If it is possible to produce to topic and write to state store in a single transaction, then I can easily first query the state store to filter out a duplicate. This also assumes that the transaction timeout, will be less than the REST timeout, but not that related to the question.
I am also aware of the solution provided here by Confluent. But this will work as long as the synchronisation time "from the topic to the store" is less than the blocking time.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/processor/StateStore.html
State store is part of Streams API. So, State store is linked with Kafka-streams. I would recommend using headers within a message to maintain state information.
Or
Create another topic to store intermediate information.
If I understand you use case properly, you can do like that:
Write REST call result to some topic - raw-data(using the producer)
Use Kafka Streams to process data from raw-data topic. Using Kafka Streams you can implement whole logic of checking/filtering duplicates, etc and writing result into golden topic.