changing persistence level of an RDD - scala

So I have a question about RDD's persistence. Let's say I have an RDD that's persisted MEMORY_AND_DISK, and I know that I now have enough memory space cleared up that I can force the data on disk into memory. Is it possible to tell spark to re-evaluate the open RDD memory and move that information?
Essentially I'm running into an issue with my RDD where I persist it and the entire RDD doesn't end up in memory until I query the RDD multiple times. This makes the first few runs extremely slow. One thing I'm hoping to try is to initially set the RDD to MEMORY_AND_DISK and then force the disk data back into memory.


Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Checkpoint version:
val savePath = "/some/path"
Write to disk version:
val df =
I think both break the lineage in the same way.
In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).
Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.
When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.
You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to
HDFS and read it back in parallel than recompute from scratch.
And there's a slight inconvenience prior to spark2.1 release;
there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.
The problem with saving to Disk in parquet and read it back is that
It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.
This wiki could be useful for further investigation
As presented in the dataset checkpointing wiki
Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.
Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.
One difference is that if your spark job needs a certain in memory partitioning scheme, eg if you use a window function, then checkpoint will persist that to disk, whereas writing to parquet will not.
I'm not aware of a way with the current versions of spark to write parquet files and then read them in again, with a particular in memory partitioning strategy. Folder level partitioning doesn't help with this.

Caching one big RDD or many small RDDs

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.
Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data ="file:///testdata/").limit(100)"col1").write.parquet("file:///test1/")"col2").write.parquet("file:///test2/")"col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data ="file:///testdata/").limit(100).cache()"col1").write.parquet("file:///test4/")"col2").write.parquet("file:///test5/")"col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Is there any way (or any plans) to be able to turn Spark distributed collections (RDDs, Dataframe or Datasets) directly into Broadcast variables without the need for a collect? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
Furthermore, there could be a Monoid-like API (a bit like combineByKey) for situations where the .toMap or whatever operation on Array[T] is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd by no means covers the keys in myUberMap but I don't know which keys I need until I traverse someOtherRdd AND suppose I use myUberMap multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.
While theoretically this is an interesting idea I will argue that although theoretically possible it has very limited practical applications. Obviously I cannot speak for PMC so I cannot say if there are any plans to implement this type of broadcasting mechanism at all.
Possible implementation:
Since Spark already provides torrent broadcasting mechanism which behavior is described as follows:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager.
If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available.
Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from.
it should be possible to reuse the same mechanism for direct node-to-node broadcasting.
It is worth noting that this approach cannot completely eliminate driver communication. Even though blocks could be created locally you still need a single source of truth to advertise a set of blocks to fetch.
Limited applications
One problem with broadcast variables is that there are quite expensive. Even if you can eliminate driver bottleneck two problems remain:
Memory required to store deserialized object on each executor.
Cost of transferring broadcasted data to every executor.
The first problem should be relatively obvious. It is not only about direct memory usage but also about GC cost and its effect on overall latency. The second one is rather subtle. I partially covered this in my answer to Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark but let's discus this further.
From network traffic perspective broadcasting a whole dataset is pretty much equivalent to creating Cartesian product. So if dataset is large enough for driver becoming a bottleneck it is unlikely to be a good candidate for broadcasting and targeted approach like hash join can be preferred in practice.
There are some methods which can be used to achieve similar results as direct broadcast and address issues enumerated above including:
Passing data via distributed file system.
Using replicated database collocated with worker nodes.
I don't know if we can do it for RDD but you can do it for Dataframe
import org.apache.spark.sql.functions
val df:DataFrame = your_data_frame
val broadcasted_df = functions.broadcast(df)
now you can use variable broadcasted_df and it will be broadcasted to executor.
Make sure broadcasted_df dataframe is not too big and can be send to executor.
broadcasted_df will be broadcaster in operations like for example
and in this case join() operation executes faster because every executor has 1 partition of other_df and whole broadcasted_df
For your question i am not sure you can do what you want. You can not use one rdd inside #map() method of another rdd because spark doesn't allowed transformations inside transformations. And in your case you need to call collect() method to create map from your RDD because you can only use usual map object inside #map() method you can not use RDD there.

Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120,
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at at at at
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!
So, my question is :
Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark
The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).
It is a critical issue which prevents use of spark with very large datasets.
Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.
The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.
the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.
most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.
when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.
a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.
The Spark 2.4.0 release removes this limit by replicating block data as a stream. See Spark-24926 for details.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
for item in processed_data.collect():
failed with OOM errors. On the other hand,
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.