Is Kafka useful if we have less messages to process - apache-kafka

Is Kafka useful if we have less messages to process. If I have 1000 messages per second to process, is Kafka feasible?

As any experienced software engineer will say, it depends ;-). There are many factors to consider. Here is just a sample:
Do you need to have these messages persisted? If not, then probably Kafka is not what you're looking for.
Even if you require persistence, it doesn't mean that Kafka can handle your throughput requirements (although my gut feeling says it can cope with your volume). The only way to determine that is to run performance tests with your message volumes against Kafka and see how it copes. It's also quite possible that other brokers like ActiveMQ can handle your volumes as well. Then it falls down to how appropriate is the broker for your use case (e.g., event sourcing?) Checkout out Kafka's docs to see how Kafka is used in the industry.
You have to keep in mind that Kafka is currently not as popular as other brokers such as ActiveMQ. So even if Kafka is useful to your scenario, you could have a hard time finding help on Kafka questions/issues you'll have along the way.

Related

Consuming messages in a Kafka topic ASAP

Imagine a scenario in which a producer is producing 100 messages per second, and we're working on a system that consuming messages ASAP matters a lot, even 5 seconds delay might result in a decision not to take care of that message anymore. also, the order of messages does not matter.
So I don't want to use a basic queue and a single pod listening on a single partition to consume messages, since in order to consume a message, the consumer needs to make multiple remote API calls and this might take time.
In such a scenario, I'm thinking of a single Kafka topic, with 100 partitions. and for each partition, I'm gonna have a separate machine (pod) listening for partitions 0 to 99.
Am I thinking right? this is my first project with Kafka. this seems a little weird to me.
For your use case, think of partitions = max number of instances of the service consuming data. Don't create extra partitions if you'll have 8 instances. This will have a negative impact if consumers need to be rebalanced and probably won't give you any performace improvement. Also 100 messages/s is very, very little, you can make this work with almost any technology.
To get the maximum performance I would suggest:
Use a round robin partitioner
Find a Parallel consumer implementation for your platform (for jvm)
And there a few producer and consumer properties that you'll need to change, but they depend your environment. For example batch.size, linger.ms, etc. I would also check about the need to set acks=all as it might be ok for you to lose data if a broker dies given that old data is of no use.
One warning: In Java, the standard kafka consumer is single threaded. This surprises many people and I'm not sure if the same is true for other platforms. So having 100s of partitions won't give any performance benefit with these consumers, and that's why it's important to use a Parallel Consumer.
One more warning: Kafka is a complex broker. It's trivial to start using it, but it's a very bumpy journey to use it correctly.
And a note: One of the benefits of Kafka is that it keeps the messages rather than delete them once they are consumed. If messages older than 5 seconds are useless for you, Kafka might be the wrong technology and using a more traditional broker might be easier (activeMQ, rabbitMQ or go to blazing fast ones like zeromq)
Your bottleneck is your application processing the event, not Kafka.
when you have ten consumers, there is overhead for connecting each consumer to Kafka so it will lower the performance.
I advise focusing on your application performance rather than message broker.
Kafka p99 Latency is 5 ms with 200 MB/s load.
https://developer.confluent.io/learn/kafka-performance/

Streaming audio streams trough MQ (scalability)

my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.

Use transactional API and exactly-once with regular Producers and Consumers

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

What makes Kafka high in throughput?

Most articles depicts Kafka better in read/write throughput than other message broker(MB) like ActiveMQ. Per mine understanding reading/writing
with the help of offset makes it faster. But I am not clear how offset makes it faster ?
After reading Kafka architecture, I have got some understanding but not clear what makes Kafka scalable and high in throughput based on below points :-
Probably with the offset, client knows which exact message it needs to read which may be one of the factor to make it high in performance.
And in case of other MB's , broker need to coordinate among consumers so
that message is delivered to only consumer. But this is the case for queues only not for topics. Then What makes Kafka topic faster than other MB's topic.
Kafka provides partitioning for scalability but other message broker(MB) like ActiveMQ also provides the clustering. so how Kafka is better for big data/high loads ?
In other MB's we can have listeners . So as soon as message comes, broker will deliver the message but in case of Kafka we need to poll which means more
load on both broker/client side ?
Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
There are actually a lot of differences that make Kafka perform well including but not limited to:
Maximized use of sequential disk reads and writes
Zero-copy processing of messages
Use of Linux OS page cache rather than Java heap for caching
Partitioning of topics across multiple brokers in a cluster
Smart client libraries that offload certain functions from the
brokers
Batching of multiple published messages to yield less frequent network round trips to the broker
Support for multiple in-flight messages
Prefetching data into client buffers for faster subsequent requests.
It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.:
https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/
Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.
Instead, I recommend focusing on the thing pull-based stream buffering does give you:
Replayability!!!
Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!
There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.