Filter Stream (Kafka?) to Dynamic Number of Clients - apache-kafka

We are currently designing a big data processing chain. We are currently looking at NiFi/MiNiFi to ingest data, do some normalization, and then export to a DB. After the normalization we are planning to fork the data so that we can have a real time feed which can be consumed by clients using different filters.
We looking at both NiFi and/or Kafka to send data to the clients, but are having design problems with both of them.
With NiFi we are considering adding a websocket server which listens for new connections and adds their filter to a custom stateful processor block. That block would filter the data and tag it with the appropriate socket id if it matched a users filter and then generate X number of flow files to be sent to the matching clients. That part seems like it would work, but we would also like to be able to queue data in the event that a client connection drops for a short period of time.
As an alternative we are looking at Kafka, but it doesn't appear to support the idea of a dynamic number of clients connecting. If we used kafka streams to filter the data it appears we would need 1 topic per client? Which would likely eventually overrun our Zookeeper instance. Or we could use NiFi to filter and then insert into different partitions sort of like here: Kafka work queue with a dynamic number of parallel consumers. But there is still a limit to the number of partitions that are supported correct? Not to mention we would have to juggle the producer and consumers to read from the right partition as we scaled up and down.
Am I missing anything for either NiFi or Kafka? Or is there another project out there I should consider for this filtered data sending?

Related

What is the point of using Kafka in this example and why not use DB straightaway?

Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.
What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.

Apache Kafka as a REST Replacement?

I would like to harness the speed and power of Apache Kafka to replace REST calls in my Java application.
My app has the following architecture:
Producer P1 writes a search query to topic search
Consumer C1 reads/consumes the search query and produces search results which it writes to another topic search_results.
Both Producer P1 and Consumer C1 are part of a group of producers/consumer living on different physical machines. I need the Producer P1 server to be the one to consume/read the search results output produced by Consumer C1 so it can serve the search results to the client who submitted the search query.
The above example was simplified for demonstration purposes - in reality the process entails several additional intermediate Producers and Consumers where the query may be thrown to/from multiple servers to be processed. The main point is that the value produced by the last Producer needs to be read/consumed by the first Producer.
In the typical Apache Kafka architecture, there's no way to ensure that the final output is read by the server that originally produced the search query - as there are multiple servers reading the same topic.
I do not want to use REST for this purpose because it is very sloooooow when processing thousands of queries. Apache Kafka can handle millions of queries with 10 millisecond latency. In my particular use case it is critical that the query is transmitted with sub-millisecond speed. Scaling with REST is also much more difficult. Suppose our traffic increases and we need to add a dozen more servers to intercept client queries. With Apache Kafka it's as simple as adding new servers and adding them to the Producer P1 group. With REST not so simple. Apache Kafka also provides a very high level of decoupling which REST does not.
What design/architecture can be used to force a specific server/produce to consume the end result of initial query?
Thanks
In the typical Apache Kafka architecture, there's no way to ensure
that the final output is read by the server that originally produced
the search query - as there are multiple servers reading the same
topic.
You can use custom partitioner in your producer that determines which search query to land in which partition.
Similarly, you can use custom partition assignor in consumer to determine which partitions should be assigned to which consumer. The consumer configuration is partition.assignment.strategy
The fact Kafka is faster than REST is due to the way it is implemented. What is important here is to decide which pattern works for you - request-response or publish-subscribe or something else. You can check this answer for REST vs Kafka.
Maybe it makes sense to have multiple topics for the answers, not just one big topic:
This way the "results" topics act as "mailboxes".
Probably you'll need to set auto.create.topics.enable=true since creating topics for all P1,...PN could be complicated.

Streaming audio streams trough MQ (scalability)

my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.