How to keep RDD persisted and consistent? - scala

I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks

There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.

Related

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

How to properly apply HashPartitioner before a join in Spark?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?
val rddA = ...
val rddB = ...
val numOfPartitions = rddA.getNumPartitions
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))
val rddAB = rddApartitioned.join(rddBpartitioned)
To reduce shuffling during the joining of two RDDs,
It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.
The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).
So your code is correct, but in a case where you join once it doesn't buy you anything.
Just one comment, better to append .persist() after .partitionBy if there are multiple actions for rddApartitioned and rddBpartitioned, otherwise, all the actions will evaluate the entire lineage of rddApartitioned and rddBpartitioned, which will cause the hash-partitioning takes place again and again.
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)).persist()
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)).persist()

Recursive Dataframe operations

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.
pseudocode:
var df = emptyDataframe
for n = 1 to 200000{
someDf=read(n)
df = df.mergeWith(somedf)
}
df.writetohdfs
In the above example I get good results when "mergeWith" does a unionAll.
However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).
In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.
Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.
Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.
EDIT:
As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe.
It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.
Am I right to assume the var df is the culprit of this?
breakdown of the issue as I see it:
dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....
When I would expect DF' A C and D are object, spark things differently and does not care if I persist or repartition or not.
to Spark it looks like this:
dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....
Update2
To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again.
This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never)
I'll run some more tests on the persisting and formulate an answer in the form of a workaround.
Result:
This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptionsjava.lang.OutOfMemoryError: GC overhead limit exceeded
If you have code like this:
var df = sc.parallelize(Seq(1)).toDF()
for(i<- 1 to 200000) {
val df_add = sc.parallelize(Seq(i)).toDF()
df = df.unionAll(df_add)
}
Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).
Try to reduce the number of partitions to e.g. 200 before persisiting the dataframe (using e.g. df.coalesce(200).write.saveAsTable(....))
So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.
It is very much a workaround rather than a fix.
val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())
for loop {
val interm = mergeDataFrame(df, mutableBufferArray.last)
val intermSchema = interm.schema
val intermRDD = interm.rdd.repartition(8)
mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
mutableBufferArray.remove(0)
}
This is how I wrestle tungsten into compliance.
By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.
In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.

Apache Spark RDD - not updating

I create a PairRDD which contains a Vector.
var newRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
Later on I update the RDD:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
However, although it outputs an updated Vector (as shown in the console), when I next call newRDD I can see the Vector value has changed. Through testing I have concluded that it has changed to something given by math.random - as every time I call newRDD the Vector changes. I understand there is a lineage graph and maybe that has something to do with it. I need to update the Vector held in the RDD to new values and I need to do this repeatedly.
Thanks.
RDD are immutable structures meant to distribute operations on data over a cluster.
There're two elements playing a role in the behavior you are observing here:
RDD lineage may be computed every time. In this case, it means that an action on newRDD might trigger the lineage computation, therefore applying the Vector(Array.fill(2){math.random}) transformation and resulting in new values each time. The lineage can be broken using cache, in which case the value of the transformation will be kept in memory and/or disk after the first time it's applied.
This results in:
val randomVectorRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
randomVectorRDD.cache()
The second aspect that needs further consideration is the on-site mutation:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
Although this might work on a single machine because all Vector references are local, it will not scale to a cluster as lookup references will be serialized and mutations will not be preserved. Therefore it bears the question of why use Spark for this.
To be implemented on Spark, this algorithm will need re-design in order to be expressed in terms of transformations instead of punctual lookup/mutations.

Unexpected Scala collections memory behaviour

The following Scala code (on 2.9.2):
var a = ( 0 until 100000 ).toStream
for ( i <- 0 until 100000 )
{
val memTot = Runtime.getRuntime().totalMemory().toDouble / ( 1024.0 * 1024.0 )
println( i, a.size, memTot )
a = a.map(identity)
}
uses an ever increasing amount of memory on every iteration of the loop. If a is defined as ( 0 until 100000 ).toList, then the memory usage is stable (give or take GC).
I understand that streams evaluate lazily but retain elements once they are generated. But it appears that in my code above, each new stream (generated by the last line of code) somehow keeps a reference to previous streams. Can someone help explain?
Here is what happens. Stream is always evaluated lazily but already calculated elements are "cached" for later. Lazy evaluation is crucial. Look at this piece of code:
a = a.flatMap( v => Some( v ) )
Although it looks as if you were transforming one Stream to another and discarding the old one, this is not what happens. The new Stream still keeps a reference to the old one. That's because result Stream should not eagerly compute all elements of underlying stream but do that on demand. Take this as an example:
io.Source.fromFile("very-large.file").getLines().toStream.
map(_.trim).
filter(_.contains("X")).
map(_.substring(0, 10)).
map(_.toUpperCase)
You can chain as many operations as you want, but file is barely touched to read first line. Each subsequent operation just wraps the previous Stream, holding a reference to child stream. The moment you ask for size or do foreach, evaluation starts.
Back to your code. In the second iteration you create third stream, holding a reference to the second one, which in turns keeps a reference to the one you initially defined. Basically you have a stack of pretty big objects growing.
But this doesn't explain why memory leaks so fast. The crucial part is... println(), or a.size to be precise. Without printing (and thus evaluating the whole Stream) Stream remains "unevaluated". Unevaluated stream doesn't cache any values, so it's very slim. Memory would still leak due to growing chain of streams in one another, but much, much slower.
This begs a questions: why it works with toList It's quite simple. List.map() eagerly creates new List. Period. The previous one is no longer referenced and eligible for GC.