Kafka streams changelog consumption rate drops during state rebuilding - apache-kafka

I recently started working with Kafka, and I'm having hard time debugging the changelog consumption rate drop during the state rebuild.
TL;DR: The shape of the graph from Grafana showing the changelog lag after deleting the PVC and the pod and waiting for the pod to start running again looks like this, and this shape doesn't look to me like what I'd expect:
The graph indicates that the lag in the changelog topic is being consumed pretty fast from the beginning, but it slows down over time.
The process is stretched over 30 minutes for a changelog of 14GB size.
More information about the most recent config:
Provider: AWS
storageClass: io1
storageSize: 3TB
podMemory: 25GB
JVM memory: 16GB
UPD: 24 partitions, no data skew
RocksDB params:
writeBuffer: 2MB
blockSize: 32KB
max Write Buffer Number: 4
min Write Buffer Number To Merge: 2
The process I follow is just deleting PVCs and the pods and measure the time it takes for a pod to start running and the changelog topic's lag go back to 0.
Results of my tuning sessions:
increased the storage size from 750GB to 3TB, result: rebuilding state for 14GB topic changed from 68 mins to 50 mins, no change in the graph shape;
changed the storage class from gp2 to io1, result: rebuilding state for 14GB topic changed from 50 mins to 30 mins, no change in the graph shape;
changed RocksDB max Write Buffer Number from 2 to 4 and min Write Buffer Number To Merge from 1 to 2; result: no change in speed neither in the graph shape;
changed pod memory from 14GB to 25 GB and JVM memory from 9GB to 16GB, no change in speed neither in the graph shape.
Where else should I look? The situation looks to me like memory saturation, but garbage collection time stays under 5%, and increasing the memory didn't help even a bit. So where else should I look? Thank you!

Related

Flink Incremental CheckPointing Compaction

We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours).
basically its : KafkaSource.keyBy(keyId).SlidingWindow(stream, slide).reduce.sink
I have enabled the check-pointing recently with rocksDB back-end and incremental=true and with hdfs persistent storage.
From last 4/5 days I m monitoring the job and its running fine but I am concerned about the check-point size. As rocksDB does compaction & merging, size is not forever growing but still it grows and till now has reached 100 gb.
So, what is the best way to check-point forever running jobs ?
It will have millions of unique keyId. so, will there be one state per key for each operator while check-pointing ?
If the total amount of your keys is under control, you don't need to worry about the growing of the size of checkpoints, which means it'll be convergent eventually.
If you still want to cut the size of checkpoint, you can set TTL for you state if your state can be regarded as expired that not being operated for a period of time.
Flink state is associated with key-group, which means a group of keys. Key-group is the unit of flink state. Each key's state will be included in a completed checkpoint. However with the incremental mode, some checkpoints will share .sst files, so you can see the checkpointed size is not that large as the total checkpoint size. If some keys are not updated between the last checkpoint interval, these keys' state won't be uploaded this time.

Kafka broker disk unestable

I have a Kafka broker running in k8s and I notice that I have a regular problem when we write in disk.
I have two metrics, the income bytes in the broker, and the bytes written in disk.
As you can see in the graph the income is quite stable, but when you see the graph below, you can see the write in disk is more unstable, and sometimes go slower than 30mb/s when the income mb/s never go below 80 mb/s
Looking the resources of the Kafka broker there are enough memory and cpu.
Is this a problem of the disk?, running dd command we can write 500 mb/s
Here I create another panel where you can see how 4 times we are around 1 minute writing in disk a lot less that the income bytes. Those gaps is killing my app with OOM

NiFi: poor performance of ConsumeKafkaRecord_2_0 and ConsumeKafka_2_0

I'm trying to load messages from relatively large topic (billion+ records, more then 100 GiB, single partition) using Apache NiFi (nifi-1.11.4-RC1, OpenJDK 8, RHEL7), but performance seems to be far too low:
1248429 messages (276.2 MB) per 5 minutes for ConsumeKafka_2_0 and 295 batches (282.5 MB) for ConsumeKafkaRecord_2_0. I.e. only 4161 messages (920 KB) per second.
Results of kafka-consumer-perf-test.sh (same node, same consumer group and same topic) are more impressive:
263.4 MB (1190937 records) per second. Too much difference for any reasonable overhead.
I've configured cluster according to Best practices for setting up a high performance NiFi installation, but throughput didn't increase.
Each node has 256 GB RAM and 20 cores, Maximum Timer Driven Thread Count is set to 120, but NiFi GUI shows only 1 or 2 active threads, and CPU load is almost zero, so is disk queue.
I've tested several flows, but even ConsumeKafka_2_0 with autoterminated 'success' relationship shows the same speed.
Is it possible to increase performance of these processors? It looks like some artificial limit or throttle, because I couldn't find any bottleneck...
Help, please, I'm completely stuck!
UPD1:
# JVM memory settings
java.arg.2=-Xms10240m
java.arg.3=-Xmx10240m
Scheduling Strategy : Timer driven
Concurrent Tasks : 64
Run Schedule : 0 sec
Execution : All nodes
Maximum Timer Driven Thread Count : 120
Maximum Event Driven Thread Count : 20
UPD2:
When I consume topic with many partitions or several topics together with one ConsumeKafka_2_0 processor, or when I use several processors with different consumer groups with same topic, total throughput increases accordingly.
So, Maximum Timer Driven Thread Count and Concurrent Tasks aren't primary culprits. Problem is somewhere in task scheduling, or in processor itself.
We've had success increasing ConsumeKafka throughput by changing the processor's yield duration from 1 to 0 seconds and increasing the socket's buffer size to 1 MB.
receive.buffer.bytes=1048576
You may find other things to try here:
https://blog.newrelic.com/engineering/kafka-best-practices/

Mongodb pod consuming memory even though it is in idle state

When Inserting data in mongodb its memory usage increases then
the data base is dropped and connections are closed, but still the memory usage continue to increase.
I have already configured wiredTiger to 700mb
As you can see the graph in the screen shot attached down,at every 30 mins data insertion and deletion takes place , which consumes max 10 minutes of time and then the connection breaks but as you can see in graph the memory usage continues to increase which then reaches its max limit and then the kuberntes pod starts showing trouble

kafka + how to avoid running out of disk storage

I want to described the following case that was on one of our production cluster
We have ambari cluster with HDP version 2.6.4
Cluster include 3 kafka machines – while each kafka have disk with 5 T
What we saw is that all kafka disks was with 100% size , so kafka disk was full and this is the reason that all kafka brokers was failed
df -h /kafka
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 5T 5T 23M 100% /var/kafka
After investigation we saw that log.retention.hours=7 days
So seems that purging is after 7 days and maybe this is the reason that kafka disks are full with 100% even if they are huge – 5T
What we want to do now – is how to avoid this case in the future?
So
We want to know – how to avoid full used capacity on kafka disks
What we need to set in Kafka config in order to purge the kafka disk according to the disk size – is it possible ?
And how to know the right value of log.retention.hours ? according to the disk size or other?
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
I think you have three options:
1) Increase the size of the disks until you notice that you have a comfortable amount of space free thanks to your increase and current retention policy of 7 days. For me a comfortable amount free is around 40% (but that is personal preference).
2) Lower your retention policy to for example 3 days and see if your disks are still full after a period of time. The right retention period varies between different use cases. If you don't need a backup of the data on Kafka when something goes wrong then just pick a very low retention period. If it is crucial that you have need those 7 days worth of data then you should not change the period but the disk sizes.
3) A combination of the options 1 and 2.
More information about optimal retention policies: Kafka optimal retention and deletion policy