Dataset.unpersist() unexpectedly affects count of other RDD's - scala

I ran into a strange problem where calling unpersist() on one Dataset affects the count of another Dataset in the same block of code. Unfortunately this happens during a complex long-running job with many Datasets so I can't summarize the whole thing here. I know this makes for a difficult question, however let me try to sketch it out. What I'm looking for is some confirmation that this behavior is unexpected and any ideas about why it may be occurring or how we can avoid it.
Edit: This problem as reported occurs on Spark 2.1.1, but does not occur on 2.1.0. The problem is 100% repeatable but only in my project with 1000's of lines of code and data, I'm working to try to distill it down to a concise example but have not yet been able, I will post any updates or re-submit my question if I find something. The fact that the exact same code and data works in 2.1.0 but not 2.1.1 leads me to believe it is due to something within Spark.
val claims:Dataset = // read claims from file
val accounts:Dataset = // read accounts from file
val providers:Dataset = // read providers from file
val payers:Dataset = // read payers from file
val claimsWithAccount:Dataset = // join claims and accounts
val claimsWithProvider:Dataset = // join claims and providers
val claimsWithPayer:Dataset = // join claimsWithProvider and payers
claimsWithPayer.persist(StorageLevel.MEMORY_AND_DISK)
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 46
// This is considered unnecessary intermediate data and can leave the cache
claimsWithAccount.unpersist()
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 41
Essentially, calling unpersist() on one of the intermediate data sets in a series of joins affects the number of rows in one of the later data sets, as reported by Dataset.count().
My understanding is that unpersist() should remove data from the cache but that it shouldn't affect the count or contents of other data sets? This is especially surprising since I explicitly persist claimsWithPayer before I unpersist the other data.

I believe the behaviour you experience is related to the change that is for "UNCACHE TABLE should un-cache all cached plans that refer to this table".
I think you may find more information in SPARK-21478 Unpersist a DF also unpersists related DFs where Xiao Li said:
This is by design. We do not want to use the invalid cached data.

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

I have not found a clear answer to this question yet, even though there are multiple similar questions in SO.
I don't fill-in all the details for the code below, as the actual transformations are not important for my questions.
// Adding _corrupt_record to have records that are not valid json
val inputDf = spark.read.schema(someSchema.add("_corrupt_record", StringType)).json(path)
/**
* The following lazy-persists the DF and does not return a new DF. Since
* Spark>=2.3 the queries from raw JSON/CSV files are disallowed when the
* referenced columns only include the internal corrupt record column
* (named _corrupt_record by default). Caching is the workaround.
*/
inputDf.persist
val uncorruptedDf = inputDf.filter($"_corrupt_record".isNull)
val corruptedDf = inputDf.filter($"_corrupt_record".isNotNull)
// Doing a count on both derived DFs - corruptedDf will also be output for further investigation
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
corruptedDf.write.json(corruptedOutputPath)
// Not corrupted data will be used for some complicated transformations
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
As you can see, I marked the input dataframe inputDf for persistence (see the reason here), but never did a count on it. Then I derived two dataframes, to both of which I did a count.
Question 1: When I do uncorruptedDf.count, what does it do to the parent dataframe inputdf? Does it trigger caching of the whole inputDf, the part of it that corresponds to uncorruptedDf.count, or nothing? RDD Documentation says that:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Question 2: Does it make sense at this point (before the two count) to persist the derived dataframes corruptedDf and uncorruptedDf and unpersist inputDf? Since there are two actions happening on each derived dataframe, I would say yes, but I am not sure. If so.. what is the correct place to unpersist the parent DF below? (A), (B), or (C)?
uncorruptedDf.persist
corruptedDf.persist
// (A) I don't think I should inputDf.unpersist here, since derived DFs are not yet persisted
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
// (B) This seems a reasonable place, to free some memory
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
// (C) Is there any value from unpersisting here?
Question 3: Same as previous question but for finalDf vs corruptedDf. As can be seen I perform two actions on the finalDf: count and write.
Thanks in advance!
For question 1:
Yes it would persist the inputdf when the first count is called which is uncorrupted.count() but it won't persist any transformation that you do on the inputdf. On next count it won't read the data from the json file but it would read it from the partitions that it cached.
For question 2:
I think you should not persist the inputdf as there is nothing that you gain. Persisting the corrupted and uncorrupted of makes sense as you are performing multiple actions on it. You are just performing transformation on the inputdf to filter corrupt and uncorrupt records and spark is smart enough to combine it as one step during its physical planning stage.To conclude you should not persist inputdf and in that way you do not have to worry about unpersisting it.
For question 3:
You should not persist final dataframe as you are only performing one action on it of writing it to physical path as json file.
PS: don't try to cache/ persist each dataframe as caching itself has performance impact and have to do additional work to keep the data in memory or save to disk based on the storage level that you specify. If there are less transformations and are not complex it better to avoid caching. You can use explain command on the dataframe to see the physical ans logical plans.

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

Why does calling cache take a long time on a Spark Dataset?

I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this:
val conversations = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("tempdir", tempDir)
.option("forward_spark_s3_credentials","true")
.option("query", "SELECT * FROM my_table "+
"WHERE date <= '2017-06-03' "+
"AND date >= '2017-03-06' ")
.load()
.cache()
If I leave off the cache, the code executes quickly because Datasets are evaluated lazily. But if I put on the cache(), the block takes a long time to run.
From the online Spark UI's Event Timeline, it appears that the SQL table is being transmitted to the worker nodes and then cached on the worker nodes.
Why is cache executing immediately? The source code appears to only mark it for caching when the data is computed:
The source code for Dataset calls through to this code in CacheManager.scala when cache or persist is called:
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
logWarning("Asked to cache already cached data.")
} else {
val sparkSession = query.sparkSession
cachedData.add(CachedData(
planToCache,
InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize,
storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName)))
}
}
Which only appears to mark for caching rather than actually caching the data. And I would expect caching to return immediately based on other answers on Stack Overflow as well.
Has anyone else seen caching happening immediately before an action is performed on the dataset? Why does this happen?
cache is one of those operators that causes execution of a dataset. Spark will materialize that entire dataset to memory. If you invoke cache on an intermediate dataset that is quite big, this may take a long time.
What might be problematic is that the cached dataset is only stored in memory. When it no longer fits, partitions of the dataset get evicted and are re-calculated as needed (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). With too little memory present, your program could spend a lot of time on re-calculations.
To speed things up with caching, you could give the application more memory, or you can try to use persist(MEMORY_AND_DISK) instead of cache.
I now believe that, as Erik van Oosten answers, the cache() command causes the query to execute.
A close look at the code in my OP does indeed appear to show that the command is being cached. There are two key lines where I think the caching is occurring:
cachedData.add(CachedData(...))
This line creates a new CachedData object, which is added to a cachedData collection of some sort. While the cached data object may be a placeholder to hold cached data later on, it seems more likely that the CachedData object truly holds cached data.
And more importantly, this line:
sparkSession.sessionState.executePlan(planToCache).executedPlan
appears to actually execute the plan. So based on my experience, Erik van Oosten gut feeling about what's going on here, and the source code, I believe that calling cache() causes a Spark Dataset's plan to be executed.

What's the advantage of the streaming support in Anorm (Play Scala)?

I have been reading the section Streaming results in Play docs. What I expected to find is a way to create a Scala Stream based on the results, so if I create a run that returns 10,000 rows that need to be parsed, it would parse them in batches (e.g. 100 at a time) or would just parse the first one and parse the rest as they're needed (so, a Stream).
What I found (from my understanding, I might be completely wrong) is basically a way to parse the results one by one, but at the end it creates a list with all the parsed results (with an arbitrary limit if you like, in this case 100 books). Let's take this example from the docs:
val books: Either[List[Throwable], List[String]] =
SQL("Select name from Books").foldWhile(List[String]()) { (list, row) =>
if (list.size == 100) (list -> false) // stop with `list`
else (list := row[String]("name")) -> true // continue with one more name
}
What advantages does that provide over a basic implementation such as:
val books: List[String] = SQL("Select name from Books").as(str("name")) // please ignore possible syntax errors, hopefully understandable still
Parsing a very large number of rows is just inefficient. You might not see it as easily for a simple class, but when you start adding a few joins and have a more complex parser, you will start to see a huge performance hit when the row count gets into the thousands.
From my personal experience, queries that return 5,000 - 10,000 rows (and more) that the parser tries to handle all at once consume so much CPU time that the program effectively hangs indefinitely.
Streaming avoids the problem of trying to parse everything all at once, or even waiting for all the results to make it back to the server over the wire.
What I would suggest is using the Anorm query result as a Source with Akka Streams.
I successfully stream that way hundreds of thousands rows.

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.