Do I need to I save intermediate subsets of data while building decision tree on spark recursively? - scala

I am building a Decision Tree on Scala/Spark (on a 50 node cluster). Since my dataset is somewhat big (~ 2TB), I want to parallelise it.
My code looks like this
def buildTree(data: RDD[Array[Double]], numInstances: Int): Node = {
// Base case
if (numInstances < minInstances) {
return new Node(isLeaf = true)
}
/*
* Find best split for all columns in data
*/
val leftRDD = data.filter(leftSplitCriteria)
val rightRDD = data.filter(rightSplitCriteria)
val subset = Seq(leftRDD, rightRDD)
val counts = Seq(numLeft, numRight)
val children = (0 until 2).map(i =>
(i,subset(i),counts(i)))
.par.map(x => {buildTree(x._2,x._3)})
return new Node(children(0), children(1), Split)
}
My questions are
Scala being a lazy language, doesn't immediately compute the output of map/filter operation. So while building a new Node, do all the filters of parents, and parents of parents, are stacked up (and recursively applied)?
What would be the best approach to build the tree in parallel? Should I cache/save the dataset in the intermediate steps?
While running this code, is it sufficient to just provide num-executers, or would it make a difference if I give executor-cores, driver-cores etc.?

I assume that the numLeft is computed using leftRDD.count() and counting is an action and will force the computation of all the dependent RDDs.
You will actually make more than once the filtering in this case, once for the count and another time for each children dependence. You should cache your RDD to avoid double computation and you only need the last one so you can unpersist the previous one at every stage.
See Apache Spark Method returning an RDD (with Tail Recursion) for more explanation
Side note: Spark uses the lazy evaluation model, I think we don't say scala is a lazy language.

I ended up parallelising split finding at each level by features.
Refer
http://zhanpengfang.github.io/418home.html
http://tullo.ch/articles/speeding-up-decision-tree-training/

Related

Spark caching in combination with multiple sources and actions

I read a lot of articles, blog and stackoverflow posts but still can't wrap my head around how spark will cache the datasets in my specific use case involving lots of transformations but only few read and save actions. Here's my use case in pseudo-code
val ds1 = spark.loadFromDatabase("table_1") // Action (1)
val ds2 = spark.loadFromDatabase("table_2") // Action (2)
val ds3 = spark.loadFromDatabase("table_3") // Action (3)
val intermediateDs1 = transform(ds3)
val intermediateDs2 = transform(ds1, intermediateDs1)
val intermediateDs3 = transform(ds2, intermediateDs1, intermediateDs2)
val intermediateResultDs1 = transform(intermediateDs2)
val intermediateResultDs2 = transform(intermediateDs3)
val finalResult1 = transform(intermediateResultDs1)
val finalResult2 = transform(intermediateResultDs2)
spark.writeToDatabase(finalResult1, "table_1") // Action (4)
spark.writeToDatabase(finalResult2, "table_2") // Action (5)
I want to achieve two things:
Prevent spark from loading the data from the tables more than once for performance reasons, but also because the actions will replace the table contents and therefore will lead to unexpected behavior while executing Action (5)
Prevent spark from executing some of the transformations multiple times for performance reasons (e.g. intermediateDs2 and intermediateDs3 depend on intermediateDs1).
So I experimented with cache() and unpersist() but I'm quite unsure on how to optimize the execution. First I thought it would be a good idea to cache the datasets which are used multiple times and unpersist them when they are not needed anymore to free up memory space.
val ds1 = spark.loadFromDatabase("table_1")
val ds2 = spark.loadFromDatabase("table_2")
val ds3 = spark.loadFromDatabase("table_3")
val intermediateDs1 = transform(ds3).cache()
val intermediateDs2 = transform(ds1, intermediateDs1).cache()
val intermediateDs3 = transform(ds2, intermediateDs1, intermediateDs2)
val intermediateResultDs1 = transform(intermediateDs2)
val intermediateResultDs2 = transform(intermediateDs3)
intermediateDs2.unpersist() // not needed anymore
intermediateDs1.unpersist() // not needed anymore
val finalResult1 = transform(intermediateResultDs1)
val finalResult2 = transform(intermediateResultDs2)
spark.writeToDatabase(finalResult1, "table_1")
spark.writeToDatabase(finalResult2, "table_2")
But I get the feeling that my assumptions regarding unpersist() is wrong, see Understanding Spark's caching
Which datasets should be cached AND unpersisted in which order in that specific scenario to achieve these goals?
Thanks!
You actually did this correct. From readability I wouldn't put the cache on the same line as the assignment but I guess it doesn't matter.
Now it's important to understand Spark is lazy. No transforms will happen until an action occurs. (the write to the database). Spark will try not to revisit the database for data, and cache it. (If it can.) But it will if the entire set doesn't fit in memory and that's just a reality. I wouldn't get to hung up about it, it's better to see if works first and hits your SLA. If it does: Great. If it doesn't I'd look at your code to optimize first before looking at playing with memory setting, but that's a problem for another day.
You correctly cached, and unpersisted.
(As an aside.) During development I might suggest writing the data to an output table. (Not the same table) This will save you on time for data loads and help you check you did things correctly. I'm not concerned about concurrency but it's likely just a better idea to not clobber your input data if you have space.

Using Broadcast variable OR using RDD filter for computing intersection of two nodes neighbors?

i have used GraphLoader to load my graph into RDDs. each node in graph has some neighbors. the main goal is to find their intersection and do some parallel and distributed operations on them.
each node at first has attribute 1 and i have changed their attribute to be (label,Isimportant) by using below code:
case class nodes_properties(label:Int, ISimportant:Boolean=false)
var work_graph=graph.mapVertices{case(node,property)=> nodes_properties(node.toInt,false)}
every time any node updates its label, the work_graph will be updated.
i have used 2 methods for finding common neighbors (intersection of 2 nodes neighbors set) of two nodes. i should mention that i will execute them on a cluster not local.
neighbors(1)=[2 3 6 9]
neighbors(2)=[1 3 5 9]
intersection(1,2)=(3 9)
first method :
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either).cache()
val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)
val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))
common_neighbors.foreach{
work_graph=work_graph.mapVertices((vid:VertexId,v:nodes_properties)=> {
x=>
if(vid==x) nodes_properties(core_node_label)
else v })
}
Second method:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either).cache()
val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))
common_neighbors2.foreach {
work_graph=work_graph.mapVertices((vid:VertexId,v:nodes_properties)=> {
x=>
if(vid==x) nodes_properties(core_node_label)
else v })
}
}
Question:
my question is this: which of the above methods run in parallel and distributed way???. i mean if i use method 1 and the broadcast variable for computing common neighbors, will the foreach method for doing some operations run in all slaves and distributed way or if i use method 2 and use filter for computing common neighbors and then doing foreach will be run in distributed?
As far as I can see, none of the methods will be run in parallel
First method, it will be executed on the location of graphs but in sequential way because you call foreach on scala Map
Second method, I guess it should execute filtering in parallel (sorry, I didn't work with GraphX), and then it'll get into sequential foreach execution
It's also noted at Spark's GraphX doc page that
Note that collectNeighborIds and collectNeighbors operators can be quite costly as they duplicate information and require substantial communication. If possible try expressing the same computation using the mapReduceTriplets operator directly.
I would suggest to re-arrange your code accordingly to advice above, map reduce (I'm referring to mapReduceTriplets) will be executed in parallel and distributed way for sure

Scala immutable list internal implementation

Suppose I am having a huge list having elements from 1 to 1 million.
val initialList = List(1,2,3,.....1 million)
and
val myList = List(1,2,3)
Now when I apply an operation such as foldLeft on the myList giving initialList as the starting value such as
val output = myList.foldLeft(initialList)(_ :+ _)
// result ==>> List(1,2,3,.....1 million, 1 , 2 , 3)
Now my question comes here, both the lists being immutable the intermediate lists that were produced were
List(1,2,3,.....1 million, 1)
List(1,2,3,.....1 million, 1 , 2)
List(1,2,3,.....1 million, 1 , 2 , 3)
By the concept of immutability, every time a new list is being created and the old one being discarded. So isn't this operation a performance killer in scala as every time a new list of 1 million elements has to be copied to create a new list.
Please correct me if I am wrong as I am trying to understand the internal implementation of an immutable list.
Thanks in advance.
Yup, this is performance killer, but this is a cost of having immutable structures (which are amazing, safe and makes programs much less buggy). That's why you should often avoid appending list if you can. There is many tricks that can avoid this issue (try to use accumulators).
For example:
Instead of:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = myList.foldLeft(initialList)(_ :+ _)
You can write:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = List(initialList,myList).flatten
Flatten is implemented to copy first line only once instead of copying it for every single append.
P.S.
At least adding element to the front of list works fast (O(1)), cause sharing of old list is possible. Let's Look at this example:
You can see how memory sharing works for immutable linked lists. Computer only keeps one copy of (b,c,d) end. But if you want to append bar to the end of baz you cannot modify baz, cause you would destroy foo, bar and raz! That's why you have to copy first list.
Appending to a List is not a good idea because List has linear cost for appending. So, if you can
either prepend to the List (List have constant time prepend)
or choose another collection that is efficient for appending. That would be a Queue
For the list of performance characteristic per operation on most scala collections, See:
https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
Note that, depending on your requirement, you may also make your own smarter collection, such as chain iterable for example

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.