What is the point of using Kafka in this example and why not use DB straightaway? - apache-kafka

Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.

What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.

Related

ksqlDB for finding average last hour, and store results back to a kafka topic?

We have a readpanda (kafka compatible) source, with sensor data. Can we do the following:
Every hour, find the average sensor data last hour for each sensor
Store them back to a topic
You want to create a materialized view over the stream of events that can be queried by other applications. Your source publishes the individual events to Kafka/Redpanda, another process observers the events and makes them available as queryable "tables" for other applications. Elaborating a few options:
KSQLdb is likely a default choice as it comes as "native" in the Kafka/Confluent stack. Be careful with using it over your production Kafka cluster. It has heavy impact on the cluster performance. See the basic tutorial or the advanced tutorial.
Use an out-of-the box solution for materialized views such as Materialize. It's easiest to setup/use and doesn't stress the Kafka broker. However, it is single-node only as of now (06/2022). See the tutorial.
Another popular option is using a stream processor and store hourly aggregates to an attached database (for example Flink storing data to Redis). This is a do-it-yourself approach. Have a look on Hazelcast. It is one process running both stream processing services and a queryable store.

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Kafka Streams - Best way to do lookups in remote store via interactive queries?

I have a bit of confusion and I would like some clarification. I have something I'm working on. I want to have one Kafka Streams topology that will have five separate KStreams reading from their own respective topic and dumping that data into a large monolithic topic. Next I'll have a GlobalKTable that will read from that monolithic topic and materialize a global store let's say called lookupStore. I want to have this materialized global store as basically a "lookup" table for other Kafka Streams applications. I've done some reading on exposing this with an RPC layer with the application.server configuration which will be in the form of some unique host:port.
Now I want to have however many separate microservices each that are Kafka Streams applications that will perform are processing events from a KStream and then doing a lookup on lookupStore via an interactive query. For instance a .filter() operation based on whether the lookup on that lookupStore returned a value or not. So here's my confusion... let's assume I hardcode that exposed RPC layer on host:port how do I query lookupStore specifically to query it. If this was in the same topology/local instance you could just do something like lookupStore.get("key")... but how do you do this within a remote Kafka Streams instance?
Or does connecting to that RPC layer expose that state store to the remote application so that it "knows" of it and you can query the lookupStore like as if it was a local instance? Is this feasible or am I going down the wrong path?
If your microservices (which are streams applications) share the same Kafka cluster as the main streaming app (that generates GlobalKTable), then they can access the Table topic corresponding to the same application and do KTable join or lookupStore.get("key"). Also it is not recommended to do remote API calls within a stream application to do lookups, because of latency. If the two Kafak clusters are different, then you could explore replicating the topics (GlobalKTable and State Store change log topics) using something like mirror maker.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Kafka and microservices - Architecture questions

In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things