Here is a hypothetical but analogous scenario to what I am trying to achieve using Kafka Streams.
I have streaming data, sales, that I want to enrich with infrequently changing lookup data, say users & items, which I am planning to create KTable for. I plan to push this enriched data to a topic and to search engine using connect sink.
How do I ensure that updates in user/item data triggers enrichments for past sales data as well, not only the new data that is ingested in stream. As I understand, KTable inserts/updates doesn't trigger any reprocessing of past data of the stream.
I believe this may be a common use-case, at least I may not be the first one to have such need. Any guidance for solution or workarounds?
If you want to update old data, it implies that you want to do a table-table join. Note though, that for this case, all data of both inputs need to be hold on the application.
Related
Question on the KStream - KTable joins. Usually this kind of join is used for data enrichment purposes where the KTable provides reference data.
So the question is, when the KTable record gets an update, how do we go about updating the older records that we already processed, enriched and probably stored in some data store?
Are there any patterns that we can follow?
(Please assume KTable - KTable wouldn’t be an option as KStream side would emit large volume of changes)
I tend to think of such joins as enriching the stream of data. In that view, records which have come through the join before an update to the KTable are "correct" at the time.
I can see two options to consider:
First, as a Kafka Streams option, would a KStream-KStream join work?
It sounds like that's the processing semantics that you'd like.
(Incidentally, I really like the docs for showing clear examples of when records are and are not emitted: https://kafka.apache.org/31/documentation/streams/developer-guide/dsl-api.html#kstream-kstream-join)
Second, since it sounds like you may be persisting the streaming data, in this case, it may make sense to do query time enrichment. Creating a view/join over the two tables in the data store may provide a sane alternative to reprocessing data in the database.
There is a legacy service that writes values to the database.
I need to converting values to events and then sending it to kafka.
I'm going to make a service that, once in a fixed delay checks for new records and sends them, also writing the submitted records ids to the technical table, but maybe there is some other way, best practice or pattern.
You may want to look into Debezium that implements Change Data Capture on relational and NoSql data stores and streams the data into Kafka.
https://github.com/debezium/debezium
https://debezium.io/documentation
Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.
I have a topic which contains events of user connection and disconnection for each session. I would like to use Kafka stream to process this topic and update KTable based on some condition. Each record cannot update KTable. So I need to process multiple records to know if KTable has to be updated.
For eg, process stream and aggregate by user and then by sessionid. If atleast one sessionid of that user has only Connected event, KTable must be updated as user online if not already.
If all sessionId of the user has Disconnected event, KTable must be updated as user offline if not already.
How can I implement such a logic?
Can we implement this KTable in all application instances so that each instance has this data available locally?
Sounds like a rather complex scenario.
Maybe, it's best to use the Processor API for this case? A KTable is basically just a KV-store, and using the Processor API, allows you to apply complex processing to decide if you want to update the state store or not. A KTable itself does not allow you to apply complex logic but it will apply each update it receives.
Thus, using the DSL, you would need to do some per-processing, and if you want to update a KTable send an update record only for this case. Something like this:
KStream stream = builder.stream("input-topic");
// apply your processing and write an update record into `updates` when necessary
KStream updates = stream...
KTable table = updates.toTable();
I am trying to better understand how to set up my cluster for running my Kafka-Stream application. I'm trying to have a better sense of the volume of data that will be involve.
In that regard, while I can quickly see that a KTable require a state store, I wonder if creating a Kstream from a topics, immediately means copping all the log of that topic into the state store obviously in an append only fashion I suppose. That is, especially if we want to expose the stream for query ?
Does Kafka automatically replicate the Data in the state store as they move in the source topic, when it is a Kstream ? As said above this sounds obvious for Ktable because of the update, but for Kstream I just want a confirmation of what happens.
State Stores are created whenever any stateful operation is called or while windowing stream.
You are right that KTable requires a state store. KTable is an abstraction of changelog stream where each record represents an update. Internally it is implemented using RocksDB where all the updated values are stored in the state store and a changelog topic. At any time, state store can be rebuilt from changelog topic.
While KStream has a different concept, it represents abstraction on record stream with the unbounded dataset in append-only format. It doesn't create any state store while reading a source topic.
Unless, you want to see the updated changelog, it is okay to use KStream instead of KTable as it avoids creating unwanted state store. KTables are always expensive as compared to KStreams. Also it depends on how you want to use the data.
If you want to expose the stream for query, you need to materialize the stream into state store.