Flink Incremental CheckPointing Compaction - streaming

We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours).
basically its : KafkaSource.keyBy(keyId).SlidingWindow(stream, slide).reduce.sink
I have enabled the check-pointing recently with rocksDB back-end and incremental=true and with hdfs persistent storage.
From last 4/5 days I m monitoring the job and its running fine but I am concerned about the check-point size. As rocksDB does compaction & merging, size is not forever growing but still it grows and till now has reached 100 gb.
So, what is the best way to check-point forever running jobs ?
It will have millions of unique keyId. so, will there be one state per key for each operator while check-pointing ?

If the total amount of your keys is under control, you don't need to worry about the growing of the size of checkpoints, which means it'll be convergent eventually.
If you still want to cut the size of checkpoint, you can set TTL for you state if your state can be regarded as expired that not being operated for a period of time.
Flink state is associated with key-group, which means a group of keys. Key-group is the unit of flink state. Each key's state will be included in a completed checkpoint. However with the incremental mode, some checkpoints will share .sst files, so you can see the checkpointed size is not that large as the total checkpoint size. If some keys are not updated between the last checkpoint interval, these keys' state won't be uploaded this time.

Related

Kafka streams changelog consumption rate drops during state rebuilding

I recently started working with Kafka, and I'm having hard time debugging the changelog consumption rate drop during the state rebuild.
TL;DR: The shape of the graph from Grafana showing the changelog lag after deleting the PVC and the pod and waiting for the pod to start running again looks like this, and this shape doesn't look to me like what I'd expect:
The graph indicates that the lag in the changelog topic is being consumed pretty fast from the beginning, but it slows down over time.
The process is stretched over 30 minutes for a changelog of 14GB size.
More information about the most recent config:
Provider: AWS
storageClass: io1
storageSize: 3TB
podMemory: 25GB
JVM memory: 16GB
UPD: 24 partitions, no data skew
RocksDB params:
writeBuffer: 2MB
blockSize: 32KB
max Write Buffer Number: 4
min Write Buffer Number To Merge: 2
The process I follow is just deleting PVCs and the pods and measure the time it takes for a pod to start running and the changelog topic's lag go back to 0.
Results of my tuning sessions:
increased the storage size from 750GB to 3TB, result: rebuilding state for 14GB topic changed from 68 mins to 50 mins, no change in the graph shape;
changed the storage class from gp2 to io1, result: rebuilding state for 14GB topic changed from 50 mins to 30 mins, no change in the graph shape;
changed RocksDB max Write Buffer Number from 2 to 4 and min Write Buffer Number To Merge from 1 to 2; result: no change in speed neither in the graph shape;
changed pod memory from 14GB to 25 GB and JVM memory from 9GB to 16GB, no change in speed neither in the graph shape.
Where else should I look? The situation looks to me like memory saturation, but garbage collection time stays under 5%, and increasing the memory didn't help even a bit. So where else should I look? Thank you!

Mongodb pod consuming memory even though it is in idle state

When Inserting data in mongodb its memory usage increases then
the data base is dropped and connections are closed, but still the memory usage continue to increase.
I have already configured wiredTiger to 700mb
As you can see the graph in the screen shot attached down,at every 30 mins data insertion and deletion takes place , which consumes max 10 minutes of time and then the connection breaks but as you can see in graph the memory usage continues to increase which then reaches its max limit and then the kuberntes pod starts showing trouble

kafka + how to avoid running out of disk storage

I want to described the following case that was on one of our production cluster
We have ambari cluster with HDP version 2.6.4
Cluster include 3 kafka machines – while each kafka have disk with 5 T
What we saw is that all kafka disks was with 100% size , so kafka disk was full and this is the reason that all kafka brokers was failed
df -h /kafka
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 5T 5T 23M 100% /var/kafka
After investigation we saw that log.retention.hours=7 days
So seems that purging is after 7 days and maybe this is the reason that kafka disks are full with 100% even if they are huge – 5T
What we want to do now – is how to avoid this case in the future?
So
We want to know – how to avoid full used capacity on kafka disks
What we need to set in Kafka config in order to purge the kafka disk according to the disk size – is it possible ?
And how to know the right value of log.retention.hours ? according to the disk size or other?
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
I think you have three options:
1) Increase the size of the disks until you notice that you have a comfortable amount of space free thanks to your increase and current retention policy of 7 days. For me a comfortable amount free is around 40% (but that is personal preference).
2) Lower your retention policy to for example 3 days and see if your disks are still full after a period of time. The right retention period varies between different use cases. If you don't need a backup of the data on Kafka when something goes wrong then just pick a very low retention period. If it is crucial that you have need those 7 days worth of data then you should not change the period but the disk sizes.
3) A combination of the options 1 and 2.
More information about optimal retention policies: Kafka optimal retention and deletion policy

Storm large window size causing executor to be killed by Nimbus

I have a java spring application that submits topologies to a storm (1.1.2) nimbus based on a DTO which creates the structure of the topology.
This is working great except for very large windows. I am testing it with several varying sliding and tumbling windows. None are giving me any issue besides a 24 hour sliding window which advances every 15 minutes. The topology will receive ~250 messages/s from Kafka and simply windows them using a simple timestamp extractor with a 3 second lag (much like all the other topologies I am testing).
I have played with the workers and memory allowances greatly to try and figure this out but my default configuration is 1 worker with a 2048mb heap size. I've also tried reducing the lag which had minimal effects.
I think that it's possible the window size is getting too large and the worker is running out of memory which delays the heartbeats or zookeeper connection check-in which in turn cause Nimbus to kill the worker.
What happens is every so often (~11 window advances) the Nimbus logs report that the Executor for that topology is "not alive" and the worker logs for that topology show either a KeeperException where the topology can't communicate with Zookeeper or a java.lang.ExceptionInInitializerError:null with a nest PrivelegedActionException.
When the topology is assigned a new worker, the aggregation I was doing is lost. I assume this is happening because the window is holding at least 250*60*15*11 (messagesPerSecond*secondsPerMinute*15mins*windowAdvancesBeforeCrash) messages which are around 84 bytes each. To complete the entire window it will end up being 250*60*15*97 messages (messagesPerSecond*secondsPerMinute*15mins*15minIncrementsIn24HoursPlusAnExpiredWindow). This is ~1.8gbs if my math is right so I feel like the worker memory should be covering the window or at least more than 11 window advances worth.
I could increase the memory slightly but not much. I could also decrease the amount of memory/worker and increase the number of workers/topology but I was wondering if there is something I'm missing? Could I just increase the amount of time the heartbeat for the worker is so that there is more time for the executor to check-in before being killed or would that be bad for some reason? If I changed the heartbeat if would be in the Config map for the topology. Thanks!
This was caused by the workers running out of memory. From looking at Storm code. it looks like Storm keeps around every message in a window as a Tuple (which is a fairly big object). With a high rate of messages and a 24 hour window, that's a lot of memory.
I fixed this by using a preliminary bucketing bolt that would bucket all the tuples in an initial 1 minute window which reduced the load on the main window significantly because it was now receiving one tuple per minute. The bucketing window doesn't run out of memory since it only has one minute of tuples at a time in its window.

Spark - Checkpointing implication on performance

From the Spark's DStreamCheckpointData, it seems like checkpointing mechanism collects the time window to be be checkpointed and updates/writes it to checkpoint files. I am trying to understand couple of things specifically:
At every checkpoint interval, does it read all the previous checkpoint data and then update the current state?. If so, what will be the impact on performance when checkpoint state grows very large, that would certainly slow down a long running streaming context.
Is there any general rule or formula to calculate checkpoint interval for different data ingestion rates, sliding window and batch intervals?
Yes, checkpointing is a blocking operation, so that it stops processing during its activity. The length of time for which computation is stopped by this serialization of state depends on the write performance of whichever media you're writing this to (have you heard of Tachyon/Alluxio ?).
On the other hand, prior checkpointing data is not read on every new checkpointing operation : the stateful information is already being maintained in Spark's cache as the stream is being operated upon (checkpoints are just a backup of it). Let's imagine the most simple state possible, a sum of all integers, met in a stream of integers : on each batch you compute a new value for this sum, based on the data you see in the batch — and you can store this partial sum in cache (see above). Every five batches or so (depending on your checkpointing interval) you write this sum to disk. Now, if you lose one executor (one partition) in a subsequent batch, you can reconstruct the total for this by only re-processing the partitions for this executor for up to the last five partitions (by reading the disk to find the last checkpoint, and re-processing the missing parts of the last up-to-five batches). But in normal processing (no incidents), you have no need to access the disk.
There is no general formula that I know of since you would have to fix the maximum amount of data you're willing to recover from. Old documentation gives a rule of thumb.
But in the case of streaming, you can think of your batch interval like a computation budget. Let's say you have a batch interval of 30 seconds. On each batch you have 30 seconds to allocate to writing to disk, or computing (batch processing time). To make sure you job is stable, you have to ensure that your batch processing time does not go over budget, otherwise you will fill up the memory of your cluster (if it takes you 35 seconds to process and "flush" 30 seconds of data, on each batch, you ingest some more data than what you flush during the same time — since your memory is finite, this eventually yields to an overfill).
Let's say your average batch processing time is 25 seconds. So on each batch, you have 5 seconds of unallocated time in your budget. You can use that for checkpointing. Now consider how long checkpointing takes you (you can tease this out of the Spark UI). 10 seconds ? 30 seconds ? One minute ?
If it takes you c seconds to checkpoint on a bi seconds batch interval, with a bp seconds batch processing time, you will "recover" from checkpointing (process the data that still comes in during that time of no processing) in:
ceil(c / (bi - bp)) batches.
If it takes you k batches to "recover" from checkpointing (i.e. to recover the lateness induced from the checkpoint), and you are checkpointing every p batches, you need to make sure you enforce k < p, to avoid an unstable job. So in our example:
so if it takes you 10 seconds to checkpoint, it will take you 10 / (30 - 25) = 2 batches to recover, so you can checkpoint every 2 batches (or more, i.e. less frequently, which I would advise to account for unplanned loss of time).
so if it takes you 30 seconds to checkpoint, it will take you 30 / (30 - 25) = 6 batches to recover, so you can checkpoint every 6 batches (or more).
if it takes you 60 seconds to checkpoint, you can checkpoint every 12 batches (or more).
Note that this assumes your checkpointing time is constant, or at least can be bounded by a maximal constant. Sadly, this is often not the case : a common mistake is to forget to delete part of the state in stateful streams using operations such as updateStatebyKey or mapWithState — yet the size of the state should always be bounded. Note that on a multitenant cluster, the time spent writing to disk is not always a constant — other jobs may be trying to access the disk concurrently on the same executor, starving you from disk iops (in this talk Cloudera reports on IO throughput degrading dramatically after > 5 concurrent write threads).
Note you should set the checkpoint interval, as the default is the first batch that occurs more than default checkpoint interval — i.e. 10s — after the last batch. For our example of a 30s batch interval, that means you checkpoint every other batch. It's often too frequently for pure fault tolerance reasons (if reprocessing a few batches doesn't have that huge a cost), even if allowable per your computation budget, and leads to the following kind of spikes in the performance graph: