When to persist and when to unpersist RDD in Spark - scala

Lets say i have the following:
val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK)
val dataset3 = dataset2.map(.....)
If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not?
I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it?
Thanks

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
Refrence from: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Related

Why SPARK repeat transformations after persist operations?

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")
When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.
Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.
try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Spark streaming: Cache DStream results across batches

Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour.
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there
until new files are read.
There's another stream to which I want to join this dataset therefore I'd like to cache.
This is a follow-up question of Batch lookup data for Spark streaming.
The answer does work fine with updateStateByKey however I don't know how to deal with cases when a KV pair is
deleted from the lookup files, as the Sequence of values in updateStateByKey keeps growing.
Also any hint how to do this with mapWithState would be great.
This is what I tried so far, but the data doesn't seem to be persisted:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}
DStreams can be persisted directly using persist method which persist every RDD in the stream:
dictionaryStream.persist
According to the official documentation this applied automatically for
window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey
so there should be no need for explicit caching in your case. Also there is no need for manual unpersisting. To quote the docs once again:
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
and a retention period is tuned automatically based on the transformations which are used in the pipeline.
Regarding mapWithState you'll have to provide a StateSpec. A minimal example requires a functions which takes key, Option of current value and previous state. Lets say you have DStream[(String, Long)] and you want to record maximum value so far:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.

Send object to specific partition with Spark

Suppose I have a RDD with nPartitions partitions, and I'm using the mapPartitionsWithIndex method, while also keeping on the driver an array x of dimension nPartitions.
Now suppose I would like to ship x(i) to partition i so that it may work on it, a naïve way to do so would be to just call x(i) in the closure, as in the following toy example :
val sc = new SparkContext()
val rdd = sc.parallelize(1 to 1000).repartition(10)
val nPartitions = rdd.partitions.length
val myArray = Array.fill(nPartitions)(math.random) //array to be shipped to executors
val result = rdd.mapPartitionsWithIndex((index,data) =>
Seq(data.map(_ * myArray(index)).sum).iterator
)
(Ignore the logic within mapPartitionsWithIndex, only the myArray(index) is what interests us.
However if my understanding is correct, this will ship the entire array myArray to all executors, as the array is in the closure. Now if we suppose the array contains large objects which may take up too much memory / serialization time, this becomes a problem.
Is there a way to avoid this, and to ship only the components of the array corresponding to the partitions within a given executor ?
This is a case of premature optimization. Sending an array as big as the number of partitions is not going to save you much vs sending just the value for the partition, if at all possible.
However, instead of sending the array as a closure, you should send the array as a
broadcast variable: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
The main difference is that the closure is serialized and sent out for each task, while, from the doc page "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks".
Not exactly sending large objects to partitions, but an inverted approach would be to use mapPartition in conjunction with partitioning by columns. Namely, using mapPartition in this fashion would be pulling in the large object on a per partition level vs. on a per row level.

Spark doesnt let me count my joined dataframes

New at Spark Jobs and I have the following problem.
When I run a count on any of the newly joined dataframes, the job runs for ages and spills memory to disk. Is there any logic error in here?
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create three dataframes for sent and clicked files. Mark them as raw, since they will be renamed
val dfSentRaw = sqlContext.read.parquet(inputPathSent)
val dfClickedRaw = sqlContext.read.parquet(inputPathClicked)
val dfFailedRaw = sqlContext.read.parquet(inputPathFailed)
// Rename the columns to avoid ambiguity when accessing the fields later
val dfSent = dfSentRaw.withColumnRenamed("customer_id", "sent__customer_id")
.withColumnRenamed("campaign_id", "sent__campaign_id")
.withColumnRenamed("ced_email", "sent__ced_email")
.withColumnRenamed("event_captured_dt", "sent__event_captured_dt")
.withColumnRenamed("riid", "sent__riid")
val dfClicked = dfClickedRaw.withColumnRenamed("customer_id", "clicked__customer_id")
.withColumnRenamed("event_captured_dt", "clicked__event_captured_dt")
val dfFailed = dfFailedRaw.withColumnRenamed("customer_id", "failed__customer_id")
// LEFT Join with CLICKED on two fields, customer_id and campaign_id
val dfSentClicked = dfSent.join(dfClicked, dfSent("sent__customer_id") === dfClicked("clicked__customer_id")
&& dfSent("sent__campaign_id") === dfClicked("campaign_id"), "left")
dfSentClicked.count() //THIS WILL NOT WORK
val dfJoined = dfSentClicked.join(dfFailed, dfSentClicked("sent__customer_id") === dfFailed("failed__customer_id")
&& dfSentClicked("sent__campaign_id") === dfFailed("campaign_id"), "left")
Why cant these two/three dataframes be counted anymore? Did I mess up some indexing by renaming?
Thank you!
That count call is the only actual materialization of your Spark job here, so it's not really count that is a problem but the shuffle that is being done for the join right before it. You don't have enough memory to do the join without spilling to disk. Spilling to disk in a shuffle is a very easy way to make your Spark jobs take forever =).
One thing that really helps prevent spilling with shuffles is having more partitions. Then there is less data moving through the shuffles at any given time. You can set spark.sql.shuffle.partitions which controls the number of partitions used by Spark Sql in a join or aggregation. It defaults to 200, so you can trying a higher setting. http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
You could increase the heap size of your local Spark allocation and/or increase the fraction of memory usable for shuffles by increasing spark.shuffle.memoryFraction (defaults to 0.4) and decreasing spark.storage.memoryFraction (defaults to 0.6). The Storage fraction is used for example when you make a .cache call and you might not care about that.
If you are so inclined to absolutely avoid the spills outright, you can turn off spilling by setting spark.shuffle.spill to false. I believe this will throw an exception if you run out of memory and need to spill instead of silently taking forever and could help you configure your memory allocation faster.