Kafka streams app stuck in IO Wait - apache-kafka

I have stream topology that reads from 2 topics, repartitions them and then joins them using a joinWindow of 2 days.
My question is that the kafka streams state store seems to be causing reads and writes of very very small files causing my stream threads to be stuck in IO Wait and due to this my performance is really lagging.
Any performance suggestions?

Try to use SSD disk, or?
Longer comment - as Kafka is programmed to be robust, it is very IO intensive, as every message is stored on the disk.
BTW, what in numbers means for you "is lagging"?
About what troughput are you talking?

Related

Guarantees on integrity in Kafka vs ActiveMQ?

The information I found comparing Apache Kafka and ActiveMQ (and similar message queuing products) is never clear about the integrity properties of each solution (especially, consistency).
With Kafka you can get the guarantee that no message is lost even in the presence of failures. Do you lose that guarantee using the "LazyPersistence" option?
By "no loss" I mean that the messages would be available to clients, even upon failure after restart - ideally, all messages arriving at the client, in the correct order.
Does ActiveMQ (either "classic" or Artemis) guarantee no loss of messages upon failure? Any configuration options that do give that guarantee? If the answer would differ for "classic" vs Artemis, that would be nice to know.
With Kafka, you can get the guarantee that no message is lost, even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
This is a large topic.
guarantee that no message is lost
This depends on a few things. First, you may configure retention - after a specific period where it is fine for you that the messages are lost. You may consider infinite retention but also beware that you have enough storage for that, maybe you need compaction of the topic?
even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
Kafka is a distributed system, it is common for distributed system to rely more on distributed replication than synchronous disk writes. Even if you write synchronous to disk - the disk may die and be lost. To what degree you want to use distributed replication (e.g. 3 or 6 replicas?) and synchronous or asynchronous disk writes depends on your requirements - but it also has a trade off in throughput. E.g. AWS Aurora is a distributed database that use 6 replicas.
There is no reasonable or practical way to have "no loss of messages" with any solution.
Kafka's approach is to replicate the data once it gets to the server. As #Jonas mentioned there is a total throughput trade-off. Kafka's producers are typically asynchronous out-of-the-box, so it is reasonable to expect that a process (container restart) or network outage would result in observable message loss from the producing application-side. Also, the LazyPersistence can lead to reasonably observable message loss due to process or server-side Kafka failure.
ActiveMQ's approach is to sync data to disk using the OS system call fsync() which is supposed to result in a write to disk. When you combine that with a RAID storage you have the most practical guarantee of data not being lost.
However, there is a alternative pattern that has nothing to do with persistence that can achieve a higher degree of guarantee. This is used by some financial trading systems and defense applications.
Often referred to as 'fanout'. ActiveMQ has a fanout transport included in its client. Works like this:
Producer sends message to 3 servers (they should be as isolated and separated from each other as possible).
Consumer(s) receive up to 3 messages.
First message through "wins" and the consumer app drops the other 2 messages.
With this approach, you can skip persistence altogether, since you have 3 independent routes and the odds of all 3 failing are low. (There are strategies to improve producer-side QOS in the event producer's network is offline).
Consumer has the option of processing first-message (fast) or requiring at least 2 messages to process and validate that the request is legit (secure, but higher latency).

How to use Kafka Streams to Split Messages into Slow and Fast Tracks

I have a stream of messages to be processed by an app written in Kafka streams, small subset of those messages require external DB lookups to be processed.
I believe this DB is too big to be streamed and too much to cache.
Is there a way to split the stream into to Fast and Slow streams so the slow one doesn't interfere with the fast one?
I have thought of the following 3 options but I was hopping there might be sth simpler or more efficient:
1) Let the messages be distributed evenly and since the volume of the ones that require reading from DB is low they wouldn't affect the overall throughput badly (latency is not a problem)
2) Use special key for the slow ones so they get assigned to one partition (I own the producer), but then it's hard to scale the slow ones and there is no guarantee that they will not interfere with the fast ones and it needs missing with producer.
3) Write the slow ones to as separate topic all together.

Apache Kafka persist all data

When using Kafka as an event store, how is it possible to configure the logs never to lose data (v0.10.0.0) ?
I have seen the (old?) log.retention.hours, and I have been considering playing with compaction keys, but is there simply an option for kafka never to delete messages ?
Or is the best option to put a ridiculously high value for the retention period ?
You don't have a better option that using a ridiculously high value for the retention period.
Fair warning : Using an infinite retention will probably hurt you a bit.
For example, default behaviour only allows a new suscriber to start from start or end of a topic, which will be at least annoying in an event sourcing perspective.
Also, Kafka, if used at scale (let's say tens of thousands of messages per second), benefits greatly for high performance storage, the cost of which will be ridiculously high with an eternal retention policy.
FYI, Kafka provides tools (Kafka Connect e.g) to easily persist data on cheap data stores.
Update: It’s Okay To Store Data In Apache Kafka
Obviously this is possible, if you just set the retention to “forever”
or enable log compaction on a topic, then data will be kept for all
time. But I think the question people are really asking, is less
whether this will work, and more whether it is something that is
totally insane to do.
The short answer is that it’s not insane, people do this all the time,
and Kafka was actually designed for this type of usage. But first, why
might you want to do this? There are actually a number of use cases,
here’s a few:
People concerned with data replaying and disk cost for eternal messages, just wanted to share some things.
Data replaying:
You can seek your consumer consumer to a given offset. It is possible even to query offset given a timestamp. Then, if your consumer doesn't need to know all data from beginning but a subset of the data is enough, you can use this.
I use kafka java libs, eg: kafka-clients. See:
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes(java.util.Map)
and
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seek(org.apache.kafka.common.TopicPartition,%20long)
Disk cost:
You can at least minimize disk space usage a lot by using something like Avro (https://avro.apache.org/docs/current/) and compation turned on.
Maybe there is a way to use symbolic links to separate between file systems. But that is only an untried idea.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.

How to minimize the latency involved in kafka messaging framework?

Scenario: I have a low-volume topic (~150msgs/sec) for which we would like to have a
low propagation delay from producer to consumer.
I added a time stamp from a producer and read it at consumer to record the propagation delay, with default configurations the msg (of 20 bytes) showed a propagation delay of 1960ms to 1230ms. No network delay is involved since, I tried on a 1 producer and 1 simple consumer on the same machine.
When I have tried adjusting the topic flush interval to 20ms, it drops
to 1100ms to 980ms. Then I tried adjusting the consumers "fetcher.backoff.ms" to 10ms, it dropped to 1070ms - 860ms.
Issue: For a 20 bytes of a msg, I would like to have a propagation delay as low as possible and ~950ms is a higher figure.
Question: Anything I am missing out in configuration?
I do welcome comments, delay which you got as minimum.
Assumption: The Kafka system involves the disk I/O before the consumer get the msg from the producer and this goes with the hard disk RPM and so on..
Update:
Tried to tune the Log Flush Policy for Durability & Latency.Following is the configuration:
# The number of messages to accept before forcing a flush of data to disk
log.flush.interval=10
# The maximum amount of time a message can sit in a log before we force a flush
log.default.flush.interval.ms=100
# The interval (in ms) at which logs are checked to see if they need to be
# flushed to disk.
log.default.flush.scheduler.interval.ms=100
For the same msg of 20 bytes, the delay was 740ms -880ms.
The following statements are made clear in the configuration itself.
There are a few important trade-offs:
Durability: Unflushed data is at greater risk of loss in the event of a crash.
Latency: Data is not made available to consumers until it is flushed (which adds latency).
Throughput: The flush is generally the most expensive operation.
So, I believe there is no way to come down to a mark of 150ms - 250ms. (without hardware upgrade) .
I am not trying to dodge the question but I think that kafka is a poor choice for this use case. While I think Kafka is great (I have been a huge proponent of its use at my workplace), its strength is not low-latency. Its strengths are high producer throughput and support for both fast and slow consumers. While it does provide durability and fault tolerance, so do more general purpose systems like rabbitMQ. RabbitMQ also supports a variety of different clients including node.js. Where rabbitMQ falls short when compared to Kafka is when you are dealing with extremely high volumes (say 150K msg/s). At that point, Rabbit's approach to durability starts to fall apart and Kafka really stands out. The durability and fault tolerance capabilities of rabbit are more than capable at 20K msg/s (in my experience).
Also, to achieve such high throughput, Kafka deals with messages in batches. While the batches are small and their size is configurable, you can't make them too small without incurring a lot of overhead. Unfortunately, message batching makes low-latency very difficult. While you can tune various settings in Kafka, I wouldn't use Kafka for anything where latency needed to be consistently less than 1-2 seconds.
Also, Kafka 0.7.2 is not a good choice if you are launching a new application. All of the focus is on 0.8 now so you will be on your own if you run into problems and I definitely wouldn't expect any new features. For future stable releases, follow the link here stable Kafka release
Again, I think Kafka is great for some very specific, though popular, use cases. At my workplace we use both Rabbit and Kafka. While that may seem gratuitous, they really are complimentary.
I know it's been over a year since this question was asked, but I've just built up a Kafka cluster for dev purposes, and we're seeing <1ms latency from producer to consumer. My cluster consists of three VM nodes running on a cloud VM service (Skytap) with SAN storage, so it's far from ideal hardware. I'm using Kafka 0.9.0.0, which is new enough that I'm confident the asker was using something older. I have no experience with older versions, so you might get this performance increase simply from an upgrade.
I'm measuring latency by running a Java producer and consumer I wrote. Both run on the same machine, on a fourth VM in the same Skytap environment (to minimize network latency). The producer records the current time (System.nanoTime()), uses that value as the payload in an Avro message, and sends (acks=1). The consumer is configured to poll continuously with a 1ms timeout. When it receives a batch of messages, it records the current time (System.nanoTime() again), then subtracts the receive time from the send time to compute latency. When it has 100 messages, it computes the average of all 100 latencies and prints to stdout. Note that it's important to run the producer and consumer on the same machine so that there is no clock sync issue with the latency computation.
I've played quite a bit with the volume of messages generated by the producer. There is definitely a point where there are too many and latency starts to increase, but it's substantially higher than 150/sec. The occasional message takes as much as 20ms to deliver, but the vast majority are between 0.5ms and 1.5ms.
All of this was accomplished with Kafka 0.9's default configurations. I didn't have to do any tweaking. I used batch-size=1 for my initial tests, but I found later that it had no effect at low volume and imposed a significant limit on the peak volume before latencies started to increase.
It's important to note that when I run my producer and consumer on my local machine, the exact same setup reports message latencies in the 100ms range -- the exact same latencies reported if I simply ping my Kafka brokers.
I'll edit this message later with sample code from my producer and consumer along with other details, but I wanted to post something before I forget.
EDIT, four years later:
I just got an upvote on this, which led me to come back and re-read. Unfortunately (but actually fortunately), I no longer work for that company, and no longer have access to the code I promised I'd share. Kafka has also matured several versions since 0.9.
Another thing I've learned in the ensuing time is that Kafka latencies increase when there is not much traffic. This is due to the way the clients use batching and threading to aggregate messages. It's very fast when you have a continuous stream of messages, but any time there is a moment of "silence", the next message will have to pay the cost to get the stream moving again.
It's been some years since I was deep in Kafka tuning. Looking at the latest version (2.5 -- producer configuration docs here), I can see that they've decreased linger.ms (the amount of time a producer will wait before sending a message, in hopes of batching up more than just the one) to zero by default, meaning that the aforementioned cost to get moving again should not be a thing. As I recall, in 0.9 it did not default to zero, and there was some tradeoff to setting it to such a low value. I'd presume that the producer code has been modified to eliminate or at least minimize that tradeoff.
Modern versions of Kafka seem to have pretty minimal latency as the results from here show:
2 ms (median)
3 ms (99th percentile)
14 ms (99.9th percentile)
Kafka can achieve around millisecond latency, by using synchronous messaging. With synchronous messaging, the producer does not collect messages into a patch before sending.
bin/kafka-console-producer.sh --broker-list my_broker_host:9092 --topic test --sync
The following has the same effect:
--batch-size 1
If you are using librdkafka as Kafka client library, you must also set socket.nagle.disable=True
See https://aivarsk.com/2021/11/01/low-latency-kafka-producers/ for some ideas on how to see what is taking those milliseconds.