How persistence works in Spark - scala

I am persisting some dataframes which are stored in var. Now when values to that var changes, how does persistence works? For example:
var checkedBefore_c = AddressValidation.validateAddressInAI(inputAddressesDF, addressDimTablePath, target_o, target_c, autoSeqColName).distinct.filter(col(CommonConstants.API_QUALITY_RATING) >= minQualityThreshold)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
var pre_checkedBefore_c = checkedBefore_c.except(checkedBefore_o)
pre_checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
checkedBefore_c = pre_checkedBefore_c.drop(target_o).drop(autoSeqColName)
.withColumn(target_o, pre_checkedBefore_c(target_c))
.withColumn(CommonConstants.API_STATUS, lit("AI-INSERT"))
.withColumn(CommonConstants.API_ERROR_MESSAGE, lit(""))
checkedBefore_c = CommonUtils.addAutoIncremetColumn(checkedBefore_c, autoSeqColName)
checkedBefore_c = checkedBefore_c.select(addDimWithLoggingSchema.head, addDimWithLoggingSchema.tail: _*)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)

You are trying to persist checkedBefore_c DataFrame, but in your code you have not called any action.
Brief explanation
Spark has two type of operation, transformation and action.
Transformation: Transformations are lazy eventuated like, map, reduceByKey, etc.
Action: Actions are eager eventuated, like foreach, count, save etc.
persist and cache are also lazy operation, so till the time you invoke action persist and cache will not be performed.
For more details please refer Action in Spark. You could also refer this.
Now how persist work.
In persist, spark store partition in in memory or disk or both.
Their are various options, for all option refer, org.apache.spark.storage.StorageLevel source code.
Each executors will be responsible to store their partition, if in memory option is given, first it will try to fit all partition if it does not fit then it will clean old cache data based(It is LRY cache). If still all partition does not fit in memory it will cache partitions which fits in memory and it will leave the rest.
If memory with disk option is selected then, first it will perform all step mentioned above and then left partition will be stored in local disk.
If replication factor is two then, each partition will be cached in two different executors.
In you case you have passed MEMORY_AND_DISK_SER, which means all object will be serialized before caching. By default Java serialization is used, but you could override it and use Kyro serialization, which is recommended.

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

I have not found a clear answer to this question yet, even though there are multiple similar questions in SO.
I don't fill-in all the details for the code below, as the actual transformations are not important for my questions.
// Adding _corrupt_record to have records that are not valid json
val inputDf = spark.read.schema(someSchema.add("_corrupt_record", StringType)).json(path)
/**
* The following lazy-persists the DF and does not return a new DF. Since
* Spark>=2.3 the queries from raw JSON/CSV files are disallowed when the
* referenced columns only include the internal corrupt record column
* (named _corrupt_record by default). Caching is the workaround.
*/
inputDf.persist
val uncorruptedDf = inputDf.filter($"_corrupt_record".isNull)
val corruptedDf = inputDf.filter($"_corrupt_record".isNotNull)
// Doing a count on both derived DFs - corruptedDf will also be output for further investigation
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
corruptedDf.write.json(corruptedOutputPath)
// Not corrupted data will be used for some complicated transformations
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
As you can see, I marked the input dataframe inputDf for persistence (see the reason here), but never did a count on it. Then I derived two dataframes, to both of which I did a count.
Question 1: When I do uncorruptedDf.count, what does it do to the parent dataframe inputdf? Does it trigger caching of the whole inputDf, the part of it that corresponds to uncorruptedDf.count, or nothing? RDD Documentation says that:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Question 2: Does it make sense at this point (before the two count) to persist the derived dataframes corruptedDf and uncorruptedDf and unpersist inputDf? Since there are two actions happening on each derived dataframe, I would say yes, but I am not sure. If so.. what is the correct place to unpersist the parent DF below? (A), (B), or (C)?
uncorruptedDf.persist
corruptedDf.persist
// (A) I don't think I should inputDf.unpersist here, since derived DFs are not yet persisted
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
// (B) This seems a reasonable place, to free some memory
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
// (C) Is there any value from unpersisting here?
Question 3: Same as previous question but for finalDf vs corruptedDf. As can be seen I perform two actions on the finalDf: count and write.
Thanks in advance!
For question 1:
Yes it would persist the inputdf when the first count is called which is uncorrupted.count() but it won't persist any transformation that you do on the inputdf. On next count it won't read the data from the json file but it would read it from the partitions that it cached.
For question 2:
I think you should not persist the inputdf as there is nothing that you gain. Persisting the corrupted and uncorrupted of makes sense as you are performing multiple actions on it. You are just performing transformation on the inputdf to filter corrupt and uncorrupt records and spark is smart enough to combine it as one step during its physical planning stage.To conclude you should not persist inputdf and in that way you do not have to worry about unpersisting it.
For question 3:
You should not persist final dataframe as you are only performing one action on it of writing it to physical path as json file.
PS: don't try to cache/ persist each dataframe as caching itself has performance impact and have to do additional work to keep the data in memory or save to disk based on the storage level that you specify. If there are less transformations and are not complex it better to avoid caching. You can use explain command on the dataframe to see the physical ans logical plans.

How to keep RDD persisted and consistent?

I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks
There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.

Why does calling cache take a long time on a Spark Dataset?

I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this:
val conversations = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("tempdir", tempDir)
.option("forward_spark_s3_credentials","true")
.option("query", "SELECT * FROM my_table "+
"WHERE date <= '2017-06-03' "+
"AND date >= '2017-03-06' ")
.load()
.cache()
If I leave off the cache, the code executes quickly because Datasets are evaluated lazily. But if I put on the cache(), the block takes a long time to run.
From the online Spark UI's Event Timeline, it appears that the SQL table is being transmitted to the worker nodes and then cached on the worker nodes.
Why is cache executing immediately? The source code appears to only mark it for caching when the data is computed:
The source code for Dataset calls through to this code in CacheManager.scala when cache or persist is called:
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
logWarning("Asked to cache already cached data.")
} else {
val sparkSession = query.sparkSession
cachedData.add(CachedData(
planToCache,
InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize,
storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName)))
}
}
Which only appears to mark for caching rather than actually caching the data. And I would expect caching to return immediately based on other answers on Stack Overflow as well.
Has anyone else seen caching happening immediately before an action is performed on the dataset? Why does this happen?
cache is one of those operators that causes execution of a dataset. Spark will materialize that entire dataset to memory. If you invoke cache on an intermediate dataset that is quite big, this may take a long time.
What might be problematic is that the cached dataset is only stored in memory. When it no longer fits, partitions of the dataset get evicted and are re-calculated as needed (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). With too little memory present, your program could spend a lot of time on re-calculations.
To speed things up with caching, you could give the application more memory, or you can try to use persist(MEMORY_AND_DISK) instead of cache.
I now believe that, as Erik van Oosten answers, the cache() command causes the query to execute.
A close look at the code in my OP does indeed appear to show that the command is being cached. There are two key lines where I think the caching is occurring:
cachedData.add(CachedData(...))
This line creates a new CachedData object, which is added to a cachedData collection of some sort. While the cached data object may be a placeholder to hold cached data later on, it seems more likely that the CachedData object truly holds cached data.
And more importantly, this line:
sparkSession.sessionState.executePlan(planToCache).executedPlan
appears to actually execute the plan. So based on my experience, Erik van Oosten gut feeling about what's going on here, and the source code, I believe that calling cache() causes a Spark Dataset's plan to be executed.

Object cache on Spark executors

A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.

In Spark API, What is the difference between makeRDD functions and parallelize function?

I have a onequestion, during make spark app.
In Spark API, What is the difference between makeRDD functions and parallelize function?
There is no difference whatsoever. To quote makeRDD doctring:
This method is identical to parallelize.
and if you take a look at the implementation it simply calls parallelize:
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
At the end of the day it is a matter of taste. One thing to consider is that makeRDD seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize.
Note: There is a second implementation of makeRDD which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize.
As noted by #zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.
For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.
To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.
Seq<Tuple2<Integer, Seq<String>>> rddElemSeq =
JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);
In this snippet, rddElemSeq contains the location preferences for each partition (an IP address). Each partition also has an Integer which acts like a key. My function myFunction consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.