OOM and data loss issues using checkpoints with spark streaming (pyspark) on Databricks

OOM and data loss issues using checkpoints with spark streaming (pyspark) on Databricks - pyspark

I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating a memory leak (~10 days before OOM, while a batch only lasts a couple of minutes). After deleting the checkpoint so that a new one could be created, the memory leak disappeared, indicating the error originated from the checkpoint. In a similar streaming job, we also had a problem where some data were never being processed (again, fixed after re-creating the checkpoint).
Disclaimer: I do not understand perfectly the in-depth behaviours of checkpoints as online documentation is evasive. Hence, I am not sure my configuration is good.
Below is a minimal example of the problem:
pyspark 3.0.1, python 3.7
The json conf of the clusters has the following element:
"spark_conf": {
"spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
"spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
}
code:
import pandas as pd
from pyspark.sql import functions as F
def for_each_batch(data, epoch_id):
pass
spark.readStream.format("delta").load("path/to/delta").filter(
F.col("TIME") > pd.Timestamp.utcnow() - pd.Timedelta(hours=1)
).writeStream.option(
"ignoreChanges", "true"
).option(
"checkpointLocation", "path/to/checkpoint"
).trigger(
processingTime="3 minutes"
).foreachBatch(
for_each_batch
).start()
PS: If the content of the function 'for_each_batch' or the filtering condition is changed, should I re-create the checkpoint?

Related

having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?

Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.

Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)

Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!

Combining small Parquet files into bigger Parquet files with Spark fails with OOM Error

I tried to create a script that will combine a lot of small parquet files into less bigger parquet files at around my file system block size.
I am running my code(spark-2.4.3 and scala-2.12.7) on a weak VM, so the files cant fit all at once into memory, disk space is limited and spark is running on local mode.
I thought of this solution:
spark.read.parquet(shortFileList:_*)
.repartition(amountOfNewFiles)
.write
.parquet("outDir")
and then I got Out of memory exception on my executor, and I wasn't sure why, I thought that spark was going to crate a amountOfNewFiles Tasks and just execute them one by one but its seems like repartition makes spark read all the small files at the same time.
Why did repartition cause spark to read all the data into memory, and
is there any workaround to get the behavior I expected?
And one more thing after asking a friend for help he came up with this solution:
shortFileList
.tail
.foldRight(spark.read.parquet(shortFileList.head)){ (newDfName, df) =>
df.union(spark.read.parquet(newDfName))
}.write.parquet("outDir")
This solution looked weird and I didn't know what to expect of it because the transformation function gets a variable out of his scope, after running it I got an GC overhead limit exceeded on my executor which was really weird because the it kept running while tossing those exception over and over instead of
crashing with an out of memory exception and I had no clue why it did so.

Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Checkpoint version:
val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()
Write to disk version:
df.write.parquet(savePath)
val df = spark.read.parquet(savePath)
I think both break the lineage in the same way.
In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).
Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.
When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.
You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to
HDFS and read it back in parallel than recompute from scratch.
And there's a slight inconvenience prior to spark2.1 release;
there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.
The problem with saving to Disk in parquet and read it back is that
It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.
This wiki could be useful for further investigation
As presented in the dataset checkpointing wiki
Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.
Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

One difference is that if your spark job needs a certain in memory partitioning scheme, eg if you use a window function, then checkpoint will persist that to disk, whereas writing to parquet will not.
I'm not aware of a way with the current versions of spark to write parquet files and then read them in again, with a particular in memory partitioning strategy. Folder level partitioning doesn't help with this.

Spark Structured Streaming Memory Bound

I am processing a stream of 100 Mb/s average load. I have six executors with each having 12 Gb of memory allocated. However, due to data load, I am getting Out of Memory errors (Error 52) in the spark executors in few minutes. It seems even though Spark dataframe is conceptually unbounded it is bounded by total executor memory?
My idea here was to save dataframe/stream as an in parquet in about every five minutes. However, it seems spark won't have a direct mechanism to purge the dataframe after that?
val out = df.
writeStream.
format("parquet").
option("path", "/applications/data/parquet/customer").
option("checkpointLocation", "/checkpoints/customer/checkpoint").
trigger(Trigger.ProcessingTime(300.seconds)).
outputMode(OutputMode.Append).
start

It seems that there is no direct way to do this. As this conflicts with the general Spark model that operations be rerunnable in case of failure.
However I would share the same sentiment of the comment at 08/Feb/18 13:21 on this issue.

Spark 2.2 write partitionBy out of memory exception

I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy function. It looks something like below (made up names):
df.write.partitionBy("account", "markers")
.mode(SaveMode.Overwrite)
.parquet(s"$location$org/$corrId/")
This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.
So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).
For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?