How can Kafka reads be constant irrespective of the datasize? - apache-kafka

As per the documentation of Kafka
the data structure used in Kafka to store the messages is a simple log where all writes are actually just appends to the log.
What I don't understand here is, many claim that Kafka performance is constant irrespective of the data size it handles.
How can random reads be constant time in a linear data structure?
If I have a single partition topic with 1 billion messages in it. How can the time taken to retrieve the first message be same as the time taken to retrieve the last message, if the reads are always sequential?

In Kafka, the log for each partition is not a single file. It is actually split in segments of fixed size.
For each segment, Kafka knows the start and end offsets. So for a random read, it's easy to find the correct segment.
Then each segment has a couple of indexes (time and offset based). Those are the file named *.index and *.timeindex. These files enable jumping directly to a location near (or at) the desired read.
So you can see that the total number of segments (also total size of the log) does not really impact the read logic.
Note also that the size of segments, the size of indexes and the index interval are all configurable settings (even at the topic level).

Related

Fixed size window for Apache Beam

How to define window of fixed size (fixed number of items) in Apache Beam?
I know that we have
(FixedWindows.of(Duration.standardMinutes(10))
but I do not care about time-only about number of items.
More details:
I am writing significant amount of data (53 gigabytes) to S3. Currently my proces uses
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.getKey())
(grouping by key). This causes serve performance bottleneck, because of skewed key distribution. My total data size is 53Gb, but size of data for one key is 37Gb. This single key takes an hour to write (writing occurs on single executor, single thread, rest of cluster waits idle).
I do not need any special grouping. Ideally I want uniform distribution of data, so writing will happen concurrently and finish as soon as possible.
Guaranteeing exactly equal sized grouping is fairly hard, but you can get pretty close by using hashes of your data modulo some constant as the keys. For example:
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.hashCode() % 530)
This will give roughly equal 100MB partitions.
Additionally, if you are using the DataflowRunner, you don't need to specify keys at all; the system will automatically group up the data, and dynamically rebalance the load to avoid stragglers. For this, use FileIO.write() instead of FileIO.writeDynamic().

Spark 2.3.1 Structured Streaming Input Rate

I wonder if there is a way to specify the size of the mini-batch in Spark Structured streaming. That is rather than only stating the mini-batch interval (Triggers), I would like to state how many Row can be in a mini-batch (DataFrame) per interval.
Is there a way to do that ?
Aside from the general capability to do that, I particularily need to apply that in testing scenario, where i have an MemoryStream. I would like Spark to consume a certain amount of data from the MemoryStream, instead of taking all of it at once, to actually see how the the overall application behave. My understanding is that the MemoryStream data structure needs to be filled before launching the job on it. Hence, how can i see the mini-batch processing behavior, is spark is able to ingest the entire content of the MemoryStream within the interval that I give ?
EDIT1
In the Kafka Integration I have found the following:
maxOffsetsPerTrigger: Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
But that is just for KAFKA integration. I have also seen
maxFilesPerTrigger: maximum number of new files to be considered in every trigger
So it seems things are defined per source types. Hence, is there a way to control how data is consumed from MEMORYSTREAM[ROW] ?
Look for below guys they can solve your problem:
1.spark.streaming.backpressure.initialRate
2.spark.streaming.backpressure.enabled

kafka | How to use replica.high.watermark.checkpoint.interval.ms

I've been looking a way to reduce duplications or totally eliminate them and what I found is an interesting property
replica.high.watermark.checkpoint.interval.ms = 5000(default)
The frequency with which the high watermark is saved out to disk
and I was going through the random link which says,
replica.high.watermark.checkpoint.interval.ms property can affect throughput. Also, we can mark the last point where we read information while reading from a partition. In this way, we have a checkpoint from which to move forward without having to reread prior data, if we have to go back and locate the missing data. So, we will never lose a message, if we set the checkpoint watermark for every event.
First, So my question is how to use replica.high.watermark.checkpoint.interval.ms and
Second, is there any way to reduce duplicates using this property?
As far as I know, the high watermark indicates the last record that consumers can see, as it is the last record that has been fully replicated for that partition. This seems to indicate that it is used to prevent a consumer from consuming a record that is not yet fully replicated across all of its brokers, so that you don't consume something that could end up lost, leading to a bad state.
Changing the interval at which this would be updated does not seem like it would reduce duplication of messages. It would potentially have a slight performance impact (smaller interval = more disk writes) however.
For reducing duplication, I'd probably look at the Kafka exactly-once semantics introduced in 0.11.

Avoiding small files from Kafka connect using HDFS connector sink in distributed mode

We have a topic with messages at the rate of 1msg per second with 3 partitions and I am using HDFS connector to write the data to HDFS as AVRo format(default), it generates files with size in KBS,So I tried altering the below properties in the HDFS properties.
"flush.size":"5000",
"rotate.interval.ms":"7200000"
but the output is still small files,So I need clarity on the following things to solve this issue:
is flush.size property mandatory, in-case if we do not mention the flus.size property how does the data gets flushed?
if the we mention the flush size as 5000 and rotate interval as 2 hours,it is flushing the data for every 2 hours for first 3 intervals but after that it flushes data randomly,Please find the timings of the file creation(
19:14,21:14,23:15,01:15,06:59,08:59,12:40,14:40)--highlighted the mismatched intervals.is it because of the over riding of properties mentioned?that takes me to the third question.
What is the preference for flush if we mention all the below properties (flush.size,rotate.interval.ms,rotate.schedule.interval.ms)
Increasing the rate of msg and reducing the partition is actually showing an increase in the size of the data being flushed, is it the only way to have control over the small files,how can we handle the properties if the rate of the input events are varying and not stable?
It would be great help if you could share documentations regarding handling small files in kafka connect with HDFS connector,Thank you.
If you are using a TimeBasedPartitioner, and the messages are not consistently going to have increasing timestamps, then you will end up with a single writer task dumping files when it sees a message with a lesser timestamp in the interval of rotate.interval.ms of reading any given record.
If you want to have consistent bihourly partition windows, then you should be using rotate.interval.ms=-1 to disable it, then rotate.schedule.interval.ms to some reasonable number that is within the partition duration window.
E.g. you have 7200 messages every 2 hours, and it's not clear how large each message is, but let's say 1MB. Then, you'd be holding ~7GB of data in a buffer, and you need to adjust your Connect heap sizes to hold that much data.
The order of presecence is
scheduled rotation, starting from the top of the hour
flush size or "message-based" time rotation, whichever occurs first, or there is a record that is seen as "before" the start of the current batch
And I believe flush size is mandatory for the storage connectors
Overall, systems such as Uber's Hudi or the previous Kafka-HDFS tool of Camus Sweeper are more equipped to handle small files. Connect Sink Tasks only care about consuming from Kafka, and writing to downstream systems; the framework itself doesn't recognize Hadoop prefers larger files.

How can I control number of rows and/or output file size in Spark streaming when writing to HDFS - hive?

Using spark streaming to read and process messages from Kafka and write to HDFS - Hive.
Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout.
Thanks.
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce and repartition commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition you can both increase and decrease your partitions, while with coalesce you can only decrease.
Also,keep in mind that repartition performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here