Sequential I/O performance for old data - apache-kafka

I am learning how sequential I/O works in total and with kafka. Per my understanding - all data written to hard disk sequentially (as log), and because of that - hard disk arm is always near by the actual data, it doesn't have to move a lot (because of that we have small/no seek time, same as a write time).
But what if we have a lot of kafka data in HDD (with max allowed retention policy) and new consumer group starting to get this data: as I understand - new consumer group will start reading data from 0 offset and this 0 offset can be in totaly opposite side of HDD (as time passed - HDD arm will be slowly moved around). So, in this case - HDD arm have to go back and forth each time, as old consumers will read the actual data and new consumer group read old data. Won't it lead to opposite effect, so sequential I/O will slow down entire cluster (at least, until new consumer group read all data)?

Related

How does persistence work in ActiveMQ Artemis?

In ActiveMQ 5.x when using kahadb for persistence all the files are managed in a single database. This can have serious consequences.
I have hundreds of queues that see millions of messages per day. If a consumer of a queue is temporarily stopped for maintenance reasons the queues continue to fill and empty, and the one whose consumer is suspended sees the messages accumulate. But on the disc it is different. Kahadb indeed marks the deleted (consumed) messages, but cannot free the place if a more present message is kept in the database. This is the case with those that accumulate in the suspended queue.
Very quickly the disk space is full.
To remedy this, you have to change the configuration and use mkahadb. In this case there is one database per queue and therefore on the disk only the suspended queue takes up space.
I am considering switching to Artemis. But the persistence has been completely redesigned. So what happens in terms of disk occupancy when suspending a consumer?
This question is pretty broad, but I'll take a crack at it...
By default ActiveMQ Artemis uses a file-based journal. The journal consists of a pool of files that can grow and shrink based on configuration (see journal-min-files and journal-pool-files in the documentation). The size of each file is also configurable (i.e. via journal-file-size).
An initial pool of files will be created when the broker starts and as messages are stored and the initial pool of files fills up then additional files will be created. As messages are consumed the pool can shrink through a process called "compaction" which is also configurable (see journal-compact-min-files and journal-compact-percentage in the documentation). As long as 1 record in a journal file is considered "live" (e.g. an unconsumed message) then the whole journal file will remain. However, you can tune the impact of this to fit your environment (e.g. by lowering the journal-file-size, making compaction more aggressive, etc.). To be clear, if compaction runs and there is a journal file with only 1 "live" record that means all the other journal files are "full" and at most you will only ever have 1 journal file like that.
Also, you can configure max-disk-usage to block producers from sending more messages once disk utilization reaches a certain point.
Ultimately, if a consumer becomes inactive (for whatever reason) then the messages that consumer was supposed to receive will accumulate in the queues (and potentially on disk). If you want to prevent messages from accumulating in the first place you could implement flow control or blocking for producers. However, even if they do accumulate the file-based journal should be able to grow and shrink as needed.
If I understood correctly there is no way to guarantee that only the payload is kept. (as with mkahadb)
But we can limit the size of the pages and fix their number.
Considering the very large number of queues I have to manage, I think the best is to divide this into a cluster. But I am worried. because when an application is in maintenace (and I have 10 000) the messages of the others cannot be erased because the messages accumulate in a queue. It is clear that whatever the configuration in a few seconds I will crash or stop.
I am surprised to see stop communication between two applications because two others no longer communicate with each other. This is a strong limitation compared to ActiveMQ.
this will limit the problem but not solve it.
with mkahadb if I have 2 queues A and B, that A receives a message every second and B receives 5000/s and the consumers of B consume them immediately. the queue B is always empty or almost and occupies very little disk. If the consumer of A is stopped. the queue A increases but the queue B does not occupy more disk.
With Artemis if I reduce the journal size to 5000. Every second a journal file is full and deleted. If A stops, there must be 1 message from A in the journal. We therefore keep 5000 messages on the disk every second. Although queue B is almost always empty. if I reduce the journal size to 500 I keep less messages but it still grows 500 times faster than with mkahaDB. And if I reduce the journal to 1 to get the same result as with mkahadb, but I force Artemis to handle millions of files which collapses the perf.
I have the impression that Artémis is not made to have very large numbers of queues contrary to ActiveMQ.
thank

How to use Kafka Streams to Split Messages into Slow and Fast Tracks

I have a stream of messages to be processed by an app written in Kafka streams, small subset of those messages require external DB lookups to be processed.
I believe this DB is too big to be streamed and too much to cache.
Is there a way to split the stream into to Fast and Slow streams so the slow one doesn't interfere with the fast one?
I have thought of the following 3 options but I was hopping there might be sth simpler or more efficient:
1) Let the messages be distributed evenly and since the volume of the ones that require reading from DB is low they wouldn't affect the overall throughput badly (latency is not a problem)
2) Use special key for the slow ones so they get assigned to one partition (I own the producer), but then it's hard to scale the slow ones and there is no guarantee that they will not interfere with the fast ones and it needs missing with producer.
3) Write the slow ones to as separate topic all together.

Avoiding small files from Kafka connect using HDFS connector sink in distributed mode

We have a topic with messages at the rate of 1msg per second with 3 partitions and I am using HDFS connector to write the data to HDFS as AVRo format(default), it generates files with size in KBS,So I tried altering the below properties in the HDFS properties.
"flush.size":"5000",
"rotate.interval.ms":"7200000"
but the output is still small files,So I need clarity on the following things to solve this issue:
is flush.size property mandatory, in-case if we do not mention the flus.size property how does the data gets flushed?
if the we mention the flush size as 5000 and rotate interval as 2 hours,it is flushing the data for every 2 hours for first 3 intervals but after that it flushes data randomly,Please find the timings of the file creation(
19:14,21:14,23:15,01:15,06:59,08:59,12:40,14:40)--highlighted the mismatched intervals.is it because of the over riding of properties mentioned?that takes me to the third question.
What is the preference for flush if we mention all the below properties (flush.size,rotate.interval.ms,rotate.schedule.interval.ms)
Increasing the rate of msg and reducing the partition is actually showing an increase in the size of the data being flushed, is it the only way to have control over the small files,how can we handle the properties if the rate of the input events are varying and not stable?
It would be great help if you could share documentations regarding handling small files in kafka connect with HDFS connector,Thank you.
If you are using a TimeBasedPartitioner, and the messages are not consistently going to have increasing timestamps, then you will end up with a single writer task dumping files when it sees a message with a lesser timestamp in the interval of rotate.interval.ms of reading any given record.
If you want to have consistent bihourly partition windows, then you should be using rotate.interval.ms=-1 to disable it, then rotate.schedule.interval.ms to some reasonable number that is within the partition duration window.
E.g. you have 7200 messages every 2 hours, and it's not clear how large each message is, but let's say 1MB. Then, you'd be holding ~7GB of data in a buffer, and you need to adjust your Connect heap sizes to hold that much data.
The order of presecence is
scheduled rotation, starting from the top of the hour
flush size or "message-based" time rotation, whichever occurs first, or there is a record that is seen as "before" the start of the current batch
And I believe flush size is mandatory for the storage connectors
Overall, systems such as Uber's Hudi or the previous Kafka-HDFS tool of Camus Sweeper are more equipped to handle small files. Connect Sink Tasks only care about consuming from Kafka, and writing to downstream systems; the framework itself doesn't recognize Hadoop prefers larger files.

How does Kafka guarantee sequential disk access?

I'm a newbie for Kafka. When I read the documentation of Kafka, I saw that Kafka is performing well because of sequential disk access.
But how is that possible? In Java(or something else), If I use File I/O, OS will handle it appropriately. However, I can't know if OS store the files I want to store in multiple sectors or in contiguous sectors. So, Kafka cannot always say that sequential disk access occurs in my opinion.
Am I true or not?
Kafka does not always access disk sequentially but it does some things that make it much more likely that disk access is often sequential. All Kafka messages are stored in larger segment files (1GB each by default) and since Kafka messages are not deleted when consumed (like in other message brokers) Kafka will not end up creating a fragmented filesystem over time by continuously creating and deleting many variable length files. Instead it creates segment files and then appends to that file until it reaches 1GB (a configurable limit). Only when all messages in the segment expire will it delete the entire 1GB segment. This means that often these 1GB sections of disk are actually laid out as contiguous blocks. It is a recommended best practice to keep these Kafka commit log files on a dedicated filesystem so it does not get fragmented by other apps reading and writing variable length files into the same filesystem. More importantly most reading an writing to these segment files is sequential and goes through OS page cache so as to reduce disk I/O even further by caching the most often accessed pages in memory. This is why it is a recommendation to tune the kernel to set swappiness to 1 to reduce the likelihood that these cached pages would get swapped out of memory.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.