Avoiding small files from Kafka connect using HDFS connector sink in distributed mode - apache-kafka

We have a topic with messages at the rate of 1msg per second with 3 partitions and I am using HDFS connector to write the data to HDFS as AVRo format(default), it generates files with size in KBS,So I tried altering the below properties in the HDFS properties.
"flush.size":"5000",
"rotate.interval.ms":"7200000"
but the output is still small files,So I need clarity on the following things to solve this issue:
is flush.size property mandatory, in-case if we do not mention the flus.size property how does the data gets flushed?
if the we mention the flush size as 5000 and rotate interval as 2 hours,it is flushing the data for every 2 hours for first 3 intervals but after that it flushes data randomly,Please find the timings of the file creation(
19:14,21:14,23:15,01:15,06:59,08:59,12:40,14:40)--highlighted the mismatched intervals.is it because of the over riding of properties mentioned?that takes me to the third question.
What is the preference for flush if we mention all the below properties (flush.size,rotate.interval.ms,rotate.schedule.interval.ms)
Increasing the rate of msg and reducing the partition is actually showing an increase in the size of the data being flushed, is it the only way to have control over the small files,how can we handle the properties if the rate of the input events are varying and not stable?
It would be great help if you could share documentations regarding handling small files in kafka connect with HDFS connector,Thank you.

If you are using a TimeBasedPartitioner, and the messages are not consistently going to have increasing timestamps, then you will end up with a single writer task dumping files when it sees a message with a lesser timestamp in the interval of rotate.interval.ms of reading any given record.
If you want to have consistent bihourly partition windows, then you should be using rotate.interval.ms=-1 to disable it, then rotate.schedule.interval.ms to some reasonable number that is within the partition duration window.
E.g. you have 7200 messages every 2 hours, and it's not clear how large each message is, but let's say 1MB. Then, you'd be holding ~7GB of data in a buffer, and you need to adjust your Connect heap sizes to hold that much data.
The order of presecence is
scheduled rotation, starting from the top of the hour
flush size or "message-based" time rotation, whichever occurs first, or there is a record that is seen as "before" the start of the current batch
And I believe flush size is mandatory for the storage connectors
Overall, systems such as Uber's Hudi or the previous Kafka-HDFS tool of Camus Sweeper are more equipped to handle small files. Connect Sink Tasks only care about consuming from Kafka, and writing to downstream systems; the framework itself doesn't recognize Hadoop prefers larger files.

Related

Kafka Streams Yearly time Window

There is a requirement in one of the applications that we are working on is, aggregation to happen on a windowed manner and the windowing size may vary monthly/quarterly/half yearly/yearly.
Kafka streams calendar based timed window supports this and I would like to get more inputs on the performance front to know if it would best suit the need.
The memory consumed by the cache to hold the records till the window size.
Number of records that gets streamed on a daily basis within the window is really high.
Please suggest can Kafka stream processing be used in this case and how about the resources for the memory management.?

Kafka reduce the no of open files as crossing 1000000

I have an kafka recieving 1gb of data every min from certain events, due to which the no of files open is going above 1000000. I am not sure which setting needs to be changed to lessen this no. Can anyone guide what could be a quick fix? should i increase the log.segment.bytes=1073741824
to 10 GB to reduce no of files getting created , or increase log.retention.check.interval.ms=300000 to 15 mins so less get checked for deletion
Increasing the size of the segments will reduce number of files maintained by the broker, with the tradeoffs being that only closed segments are ones that get cleaned or compacted.
The other alternative is to reconsider what types of data you're using. For example, if sending files or other large binary blobs, consider using filesystem URIs rather than push the whole data through a topic

How does persistence work in ActiveMQ Artemis?

In ActiveMQ 5.x when using kahadb for persistence all the files are managed in a single database. This can have serious consequences.
I have hundreds of queues that see millions of messages per day. If a consumer of a queue is temporarily stopped for maintenance reasons the queues continue to fill and empty, and the one whose consumer is suspended sees the messages accumulate. But on the disc it is different. Kahadb indeed marks the deleted (consumed) messages, but cannot free the place if a more present message is kept in the database. This is the case with those that accumulate in the suspended queue.
Very quickly the disk space is full.
To remedy this, you have to change the configuration and use mkahadb. In this case there is one database per queue and therefore on the disk only the suspended queue takes up space.
I am considering switching to Artemis. But the persistence has been completely redesigned. So what happens in terms of disk occupancy when suspending a consumer?
This question is pretty broad, but I'll take a crack at it...
By default ActiveMQ Artemis uses a file-based journal. The journal consists of a pool of files that can grow and shrink based on configuration (see journal-min-files and journal-pool-files in the documentation). The size of each file is also configurable (i.e. via journal-file-size).
An initial pool of files will be created when the broker starts and as messages are stored and the initial pool of files fills up then additional files will be created. As messages are consumed the pool can shrink through a process called "compaction" which is also configurable (see journal-compact-min-files and journal-compact-percentage in the documentation). As long as 1 record in a journal file is considered "live" (e.g. an unconsumed message) then the whole journal file will remain. However, you can tune the impact of this to fit your environment (e.g. by lowering the journal-file-size, making compaction more aggressive, etc.). To be clear, if compaction runs and there is a journal file with only 1 "live" record that means all the other journal files are "full" and at most you will only ever have 1 journal file like that.
Also, you can configure max-disk-usage to block producers from sending more messages once disk utilization reaches a certain point.
Ultimately, if a consumer becomes inactive (for whatever reason) then the messages that consumer was supposed to receive will accumulate in the queues (and potentially on disk). If you want to prevent messages from accumulating in the first place you could implement flow control or blocking for producers. However, even if they do accumulate the file-based journal should be able to grow and shrink as needed.
If I understood correctly there is no way to guarantee that only the payload is kept. (as with mkahadb)
But we can limit the size of the pages and fix their number.
Considering the very large number of queues I have to manage, I think the best is to divide this into a cluster. But I am worried. because when an application is in maintenace (and I have 10 000) the messages of the others cannot be erased because the messages accumulate in a queue. It is clear that whatever the configuration in a few seconds I will crash or stop.
I am surprised to see stop communication between two applications because two others no longer communicate with each other. This is a strong limitation compared to ActiveMQ.
this will limit the problem but not solve it.
with mkahadb if I have 2 queues A and B, that A receives a message every second and B receives 5000/s and the consumers of B consume them immediately. the queue B is always empty or almost and occupies very little disk. If the consumer of A is stopped. the queue A increases but the queue B does not occupy more disk.
With Artemis if I reduce the journal size to 5000. Every second a journal file is full and deleted. If A stops, there must be 1 message from A in the journal. We therefore keep 5000 messages on the disk every second. Although queue B is almost always empty. if I reduce the journal size to 500 I keep less messages but it still grows 500 times faster than with mkahaDB. And if I reduce the journal to 1 to get the same result as with mkahadb, but I force Artemis to handle millions of files which collapses the perf.
I have the impression that Artémis is not made to have very large numbers of queues contrary to ActiveMQ.
thank

How can Kafka reads be constant irrespective of the datasize?

As per the documentation of Kafka
the data structure used in Kafka to store the messages is a simple log where all writes are actually just appends to the log.
What I don't understand here is, many claim that Kafka performance is constant irrespective of the data size it handles.
How can random reads be constant time in a linear data structure?
If I have a single partition topic with 1 billion messages in it. How can the time taken to retrieve the first message be same as the time taken to retrieve the last message, if the reads are always sequential?
In Kafka, the log for each partition is not a single file. It is actually split in segments of fixed size.
For each segment, Kafka knows the start and end offsets. So for a random read, it's easy to find the correct segment.
Then each segment has a couple of indexes (time and offset based). Those are the file named *.index and *.timeindex. These files enable jumping directly to a location near (or at) the desired read.
So you can see that the total number of segments (also total size of the log) does not really impact the read logic.
Note also that the size of segments, the size of indexes and the index interval are all configurable settings (even at the topic level).

Spark 2.3.1 Structured Streaming Input Rate

I wonder if there is a way to specify the size of the mini-batch in Spark Structured streaming. That is rather than only stating the mini-batch interval (Triggers), I would like to state how many Row can be in a mini-batch (DataFrame) per interval.
Is there a way to do that ?
Aside from the general capability to do that, I particularily need to apply that in testing scenario, where i have an MemoryStream. I would like Spark to consume a certain amount of data from the MemoryStream, instead of taking all of it at once, to actually see how the the overall application behave. My understanding is that the MemoryStream data structure needs to be filled before launching the job on it. Hence, how can i see the mini-batch processing behavior, is spark is able to ingest the entire content of the MemoryStream within the interval that I give ?
EDIT1
In the Kafka Integration I have found the following:
maxOffsetsPerTrigger: Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
But that is just for KAFKA integration. I have also seen
maxFilesPerTrigger: maximum number of new files to be considered in every trigger
So it seems things are defined per source types. Hence, is there a way to control how data is consumed from MEMORYSTREAM[ROW] ?
Look for below guys they can solve your problem:
1.spark.streaming.backpressure.initialRate
2.spark.streaming.backpressure.enabled