Spark - Checkpointing implication on performance - scala

From the Spark's DStreamCheckpointData, it seems like checkpointing mechanism collects the time window to be be checkpointed and updates/writes it to checkpoint files. I am trying to understand couple of things specifically:
At every checkpoint interval, does it read all the previous checkpoint data and then update the current state?. If so, what will be the impact on performance when checkpoint state grows very large, that would certainly slow down a long running streaming context.
Is there any general rule or formula to calculate checkpoint interval for different data ingestion rates, sliding window and batch intervals?

Yes, checkpointing is a blocking operation, so that it stops processing during its activity. The length of time for which computation is stopped by this serialization of state depends on the write performance of whichever media you're writing this to (have you heard of Tachyon/Alluxio ?).
On the other hand, prior checkpointing data is not read on every new checkpointing operation : the stateful information is already being maintained in Spark's cache as the stream is being operated upon (checkpoints are just a backup of it). Let's imagine the most simple state possible, a sum of all integers, met in a stream of integers : on each batch you compute a new value for this sum, based on the data you see in the batch — and you can store this partial sum in cache (see above). Every five batches or so (depending on your checkpointing interval) you write this sum to disk. Now, if you lose one executor (one partition) in a subsequent batch, you can reconstruct the total for this by only re-processing the partitions for this executor for up to the last five partitions (by reading the disk to find the last checkpoint, and re-processing the missing parts of the last up-to-five batches). But in normal processing (no incidents), you have no need to access the disk.
There is no general formula that I know of since you would have to fix the maximum amount of data you're willing to recover from. Old documentation gives a rule of thumb.
But in the case of streaming, you can think of your batch interval like a computation budget. Let's say you have a batch interval of 30 seconds. On each batch you have 30 seconds to allocate to writing to disk, or computing (batch processing time). To make sure you job is stable, you have to ensure that your batch processing time does not go over budget, otherwise you will fill up the memory of your cluster (if it takes you 35 seconds to process and "flush" 30 seconds of data, on each batch, you ingest some more data than what you flush during the same time — since your memory is finite, this eventually yields to an overfill).
Let's say your average batch processing time is 25 seconds. So on each batch, you have 5 seconds of unallocated time in your budget. You can use that for checkpointing. Now consider how long checkpointing takes you (you can tease this out of the Spark UI). 10 seconds ? 30 seconds ? One minute ?
If it takes you c seconds to checkpoint on a bi seconds batch interval, with a bp seconds batch processing time, you will "recover" from checkpointing (process the data that still comes in during that time of no processing) in:
ceil(c / (bi - bp)) batches.
If it takes you k batches to "recover" from checkpointing (i.e. to recover the lateness induced from the checkpoint), and you are checkpointing every p batches, you need to make sure you enforce k < p, to avoid an unstable job. So in our example:
so if it takes you 10 seconds to checkpoint, it will take you 10 / (30 - 25) = 2 batches to recover, so you can checkpoint every 2 batches (or more, i.e. less frequently, which I would advise to account for unplanned loss of time).
so if it takes you 30 seconds to checkpoint, it will take you 30 / (30 - 25) = 6 batches to recover, so you can checkpoint every 6 batches (or more).
if it takes you 60 seconds to checkpoint, you can checkpoint every 12 batches (or more).
Note that this assumes your checkpointing time is constant, or at least can be bounded by a maximal constant. Sadly, this is often not the case : a common mistake is to forget to delete part of the state in stateful streams using operations such as updateStatebyKey or mapWithState — yet the size of the state should always be bounded. Note that on a multitenant cluster, the time spent writing to disk is not always a constant — other jobs may be trying to access the disk concurrently on the same executor, starving you from disk iops (in this talk Cloudera reports on IO throughput degrading dramatically after > 5 concurrent write threads).
Note you should set the checkpoint interval, as the default is the first batch that occurs more than default checkpoint interval — i.e. 10s — after the last batch. For our example of a 30s batch interval, that means you checkpoint every other batch. It's often too frequently for pure fault tolerance reasons (if reprocessing a few batches doesn't have that huge a cost), even if allowable per your computation budget, and leads to the following kind of spikes in the performance graph:

Related

Flink Incremental CheckPointing Compaction

We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours).
basically its : KafkaSource.keyBy(keyId).SlidingWindow(stream, slide).reduce.sink
I have enabled the check-pointing recently with rocksDB back-end and incremental=true and with hdfs persistent storage.
From last 4/5 days I m monitoring the job and its running fine but I am concerned about the check-point size. As rocksDB does compaction & merging, size is not forever growing but still it grows and till now has reached 100 gb.
So, what is the best way to check-point forever running jobs ?
It will have millions of unique keyId. so, will there be one state per key for each operator while check-pointing ?
If the total amount of your keys is under control, you don't need to worry about the growing of the size of checkpoints, which means it'll be convergent eventually.
If you still want to cut the size of checkpoint, you can set TTL for you state if your state can be regarded as expired that not being operated for a period of time.
Flink state is associated with key-group, which means a group of keys. Key-group is the unit of flink state. Each key's state will be included in a completed checkpoint. However with the incremental mode, some checkpoints will share .sst files, so you can see the checkpointed size is not that large as the total checkpoint size. If some keys are not updated between the last checkpoint interval, these keys' state won't be uploaded this time.

Sequential I/O performance for old data

I am learning how sequential I/O works in total and with kafka. Per my understanding - all data written to hard disk sequentially (as log), and because of that - hard disk arm is always near by the actual data, it doesn't have to move a lot (because of that we have small/no seek time, same as a write time).
But what if we have a lot of kafka data in HDD (with max allowed retention policy) and new consumer group starting to get this data: as I understand - new consumer group will start reading data from 0 offset and this 0 offset can be in totaly opposite side of HDD (as time passed - HDD arm will be slowly moved around). So, in this case - HDD arm have to go back and forth each time, as old consumers will read the actual data and new consumer group read old data. Won't it lead to opposite effect, so sequential I/O will slow down entire cluster (at least, until new consumer group read all data)?

Slow reads on MongoDB from Spark - weird task allocation

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

Repartioning Large Files in Spark

I am very new to Spark and got a file of 1 TB to process.
My system specification is :
Each node: 64 GB RAM
Number Of nodes:2
Cores per node: 5
As I know I have to repartition the data for better parallelism as spark will try to create default partition only by (totalNumber of cores * 2 or 3 or 4).
But in my case since Data file is very huge, I have to repartition this data to a number such that this data can be processed in a efficient manner.
How to choose the number of Partitions to be passed in repartition??How should I calculate it?What approach I should take to solve this..
Thanks a lot in advance.
partitions and parallelism are two different things per my understanding. However both go hand in hand when it comes to parallel executions of tasks in Spark.
Parallelism is number of executors * number of cores , which in your case is 2 * 5 = 10. So at any given moment you could have 10 tasks running at most.
If your data is divided into 10 partitions then all of it would be processing at once. However if you have 20 partitions then Spark would start processing 10 partitions and based on when each task finish , spark will schedule next partitions to process. This will happen until it finish processing all the partitions.
By default one partition is one block of data. I am guessing your 1 TB of Data is stored on HDFS. If underlying block size is 256MB then you would have 1TB/256MB number of blocks which in turn are partitions.
Please note that once the data is read you can always repartition it based on your requirement.
How to choose the number of Partitions to be passed in
repartition??How should I calculate it?What approach I should take to
solve this..
You need to see how your spark application holds up with the size of partition and then determine if you can decrease or increase that number. One thing is the executor memory consideration as well. If your partition is too big then you can run into OutOfMemory errors as well. These are just the guidelines and not the extensive list.
This https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ multipart series has more detailed discussion on partitions and executors.

How to retry DynamoDb write when throttled?

I am trying to write large amounts of data to dynamo using AmazonDynamoDBAsyncClient and I am trying to understand what the best practice of handling throttling is?
For example, I have a capacity of 3000 writes and at a given moment I have, let's say, 100,000 records I'd like to write. I don't need them all in immediately, but I am trying to figure what the best way to get them in is.
This application is running in a distributed environment so there maybe 5 executors all trying to do this at the same time. Would the best way to handle this be this way? Where I sleep the write process should we hit the throttle? Or should I be doing something to avoid the throttle completely. In fact, is my code even doing what I think it is, which is retrying the data after waiting a second?
try{
amazonDynamoAsyncDb.updateItemAsync(updateRequest)
}catch{
case e: ThrottlingException => {
Thread.sleep(1000)
//retry here, but how?
}
}
The AWS SDK for Java will retry throttled requests 10 times by default, before throwing a ProvisionedThroughputExceededException. If your items are small (1KB or less) and you are performing the writes from EC2 in the same region as your table you can assume each write will take around 10 ms. That means each thread of processing can do about 100 writes per second. To scale your writes to 3000 writes per second, you would need 30 threads and 30 HTTP connections. 3000 small (1kb) writes per second translates to a data throughput of 2.92 MB per second. Thus, for this write load, it does not appear that EC2 hardware could become a bottleneck. I recommend you do some measurements to figure out how long it takes to write each of your items on average, and scale your threads and HTTP connections appropriately.