I am working on an application that processes very few records in a minute. The request rate would be around 2 calls per minute. These requests are create and update made for a set of data. The requirements were delivery guarantee, reliable delivery, ordering guarantee and preventing any loss of messages.
Our team has decided to use Kafka and I think it does not fit the use case since Kafka is best suitable for streaming data. Instead we could have been better off with traditional message model as well. Though Kafka does provide ordering per partition, the same can be achieved on a traditional messaging system if the number of messages Is low and sources of data is also low. Would that be a fair statement ?
We are using Kafka streams for processing the data and the processing requires that we do lookups to external systems. If the external systems are not available then we stop processing and automatically deliver messages to target systems when the external lookup systems are available.
At the moment, we stop processing by continuously looping in the middle of a processing and checking if the systems are available.
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Regarding your point 2:
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
If, as in your case, you have a very low incoming data rate (a few records per minute), then it might be ok to pause processing an input stream when required dependency systems are not available currently.
In Kafka Streams the preferable API to implement such a behavior -- which, as you are alluding to yourself, is not really a recommended pattern -- is the Processor API.
Even so there are a couple of important questions you need to answer yourself, such as:
What is the desired/required behavior of your stream processing application if the external systems are down for extended periods of time?
Could the incoming data rate increase at some point, which could mean that you would need to abandon the pausing approach above?
But again, if pausing is what you want or need to do, then you can give it a try.
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Some stream processing tools allow you to do that. Whether it's the best pattern to use them is a different question.
For instance, you could also consider the following alternative: You could automatically ingest the external systems' data into Kafka, too, for example via Kafka's built-in Kafka Connect framework. Then, in Kafka Streams, you could read this exported data into a KTable (think of this KTable as a continuously updated cache of the latest data from your external system), and then perform a stream-table join between your original, low-rate input stream and this KTable. Such stream-table joins are a common (and recommended) pattern to enrich an incoming data stream with side data (disclaimer: I wrote this article); for example, to enrich a stream of user click events with the latest user profile information. One of the advantages of this approach -- compared to your current setup of querying external systems combined with a pausing behavior -- is that your stream processing application would be decoupled from the availability (and scalability) of your external systems.
is only a fair statement for traditional message brokers when there is a single consumer (i.e. an exclusive queue). As soon as the queue is shared by more than one consumer, there will be the possibility of out of order delivery of messages. This is because any one consumer might fail to processes and ACK a message resulting in the message being put back at the head of the shared queue, and subsequently delivered (out of order) to another consumer. Kafka guarantees in order parallel consumption across multiple consumers using topic partitions (which are not present in traditional message brokers).
Related
my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.
Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.
I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked
I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.
I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.
What would be the advantage of using kafka streams over akka streams kafka?
Your question is very general, so I'll give a general answer from my point of view.
First, I've got two usage scenario:
cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
cases where either the data source or sink is not kafka, for those I'm using akka streams.
This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.
Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:
some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!
The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams
Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).
https://www.beyondthelines.net/computing/kafka-streams/
message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.
Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."
My take is that this applies mostly to an enterprise application where the "application state" is ... small.
For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.
Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).
In other words the business logic does not become tangled with the Kafka apis.
Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model.
These are high-performance library built for the JVM and specially designed for general-purpose microservices.
Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.
Well I used both of those and I have a pretty good idea about their strength's and weaknesses.
If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.
If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.
Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.
And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.
"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements.
The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."
Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.
So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.
You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2
Most articles depicts Kafka better in read/write throughput than other message broker(MB) like ActiveMQ. Per mine understanding reading/writing
with the help of offset makes it faster. But I am not clear how offset makes it faster ?
After reading Kafka architecture, I have got some understanding but not clear what makes Kafka scalable and high in throughput based on below points :-
Probably with the offset, client knows which exact message it needs to read which may be one of the factor to make it high in performance.
And in case of other MB's , broker need to coordinate among consumers so
that message is delivered to only consumer. But this is the case for queues only not for topics. Then What makes Kafka topic faster than other MB's topic.
Kafka provides partitioning for scalability but other message broker(MB) like ActiveMQ also provides the clustering. so how Kafka is better for big data/high loads ?
In other MB's we can have listeners . So as soon as message comes, broker will deliver the message but in case of Kafka we need to poll which means more
load on both broker/client side ?
Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
There are actually a lot of differences that make Kafka perform well including but not limited to:
Maximized use of sequential disk reads and writes
Zero-copy processing of messages
Use of Linux OS page cache rather than Java heap for caching
Partitioning of topics across multiple brokers in a cluster
Smart client libraries that offload certain functions from the
brokers
Batching of multiple published messages to yield less frequent network round trips to the broker
Support for multiple in-flight messages
Prefetching data into client buffers for faster subsequent requests.
It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.:
https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/
Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.
Instead, I recommend focusing on the thing pull-based stream buffering does give you:
Replayability!!!
Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!
There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).