Spark job slow on cluster than standalone - scala

I have this piece of code which works fine in standalone but works slowly when it comes to work on a cluster of 4 slaves (8cores 30Go memory) at AWS.
For a file of 10000 entries
Standalone : 257s
Aws 4S : 369s
def tabHash(nb:Int, dim:Int) = {
var tabHash0 = Array(Array(0.0)).tail
for( ind <- 0 to nb-1) {
var vechash1 = Array(0.0).tail
for( ind <- 0 to dim-1) {
val nG = Random.nextGaussian
vechash1 = vechash1 :+ nG
}
tabHash0 = tabHash0 :+ vechash1
}
tabHash0
}
def hashmin3(x:Vector, w:Double, b:Double, tabHash1:Array[Array[Double]]) = {
var tabHash0 = Array(0.0).tail
val x1 = x.toArray
for( ind <- 0 to tabHash1.size-1) {
var sum = 0.0
for( ind2 <- 0 to x1.size-1) {
sum = sum + (x1(ind2)*tabHash1(ind)(ind2))
}
tabHash0 = tabHash0 :+ (sum+b)/w
}
tabHash0
}
def pow2(tab1:Array[Double], tab2:Array[Double]) = {
var sum = 0.0
for( ind <- 0 to tab1.size-1) {
sum = sum - Math.pow(tab1(ind)-tab2(ind),2)
}
sum
}
val w = ww
val b = Random.nextDouble * w
val tabHash2 = tabHash(nbseg,dim)
var rdd_0 = parsedData.map(x => (x.get_id,(x.get_vector,hashmin3(x.get_vector,w,b,tabHash2)))).cache
var rdd_Yet = rdd_0
for( ind <- 1 to maxIterForYstar ) {
var rdd_dist = rdd_Yet.cartesian(rdd_0).flatMap{ case (x,y) => Some((x._1,(y._2._1,pow2(x._2._2,y._2._2))))}//.coalesce(64)
var rdd_knn = rdd_dist.topByKey(k)(Ordering[(Double)].on(x=>x._2))
var rdd_bary = rdd_knn.map(x=> (x._1,Vectors.dense(bary(x._2,k))))
rdd_Yet = rdd_bary.map(x=>(x._1,(x._2,hashmin3(x._2,w,b,tabHash2))))
}
I tried to broadcast some variables
val w = sc.broadcast(ww.toDouble)
val b = sc.broadcast(Random.nextDouble * ww)
val tabHash2 = sc.broadcast(tabHash(nbseg,dim))
Without any effects
I know that's not the bary function because i tried another version of this code without hashmin3 which works fine with 4 slaves and worse with 8 slaves which is for another topic.

Bad code. Especially for distributed and large computations. I can't fast tell what is root of problem, but you anyway must rewrite this code.
Array is terrible for universal and sharable data. It is mutable and require continious memory allocation (last may be a problem even you have enough of memory). Better use Vector (or List sometimes). Never use arrays, really.
var vechash1 = Array(0.0).tail You create collection with one element, then call function to get empty collection. If it's rare no worry about performance, but it's ugly! var vechash1: Array[Double] = Array() or var vechash1: Vector[Double] = Vector() or var vechash1 = Vector.empty[Double].
def tabHash(nb:Int, dim:Int) = Always set return type of function when it's unclear. Power of scala is rich type system. It's very helpful to have compile time checks (about what you exactly get in result but not what you imagine to get!). It's very important when deal with huge data, because compile checks save your time and money. Also it's more easy to read such code later. def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] =
def hashmin3(x: Vector, typo? it doesn't compile without type parameter.
First function more compact:
def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] = {
(0 to nb-1).map {_ =>
(0 to dim - 1).map(_ => Random.nextGaussian()).toVector
}.toVector
}
Second function is ((x*M) + scalar_b)/scalar_w. It's may be more efficient to use library that is specifically optimized for work with matrices.
Third (i suppose mistake here with sign of computation, if you count square error):
def pow2(tab1:Vector[Double], tab2:Vector[Double]): Double =
tab1.zip(tab2).map{case (t1,t2) => Math.pow(t1 - t2, 2)}.reduce(_ - _)
var rdd_Yet = rdd_0 Cached RDD is rewrited in cycle. So it's useless storage.
Last cycle is hard to analyse. I think it's must be simplified.

Related

How to append a string to a list or array in a for loop in scala?

var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]()
for(element<-ts_rdd)
{
var elem1 = element._1
var kpssSignificance: Double = 0.05
var dOpt: Option[Int] = (0 to 2).find
{
diff =>
var testTs = differencesOfOrderD(element._2, diff)
var (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
var d = dOpt match
{
case Some(v) => v
case None => 300000
}
if(d.equals(300000))
{
println("Bad Key: " + elem1)
RetainKeyList += elem1
}
Hi all,
I created a empty mutable list buffer var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]() and I am trying to add a string elem1 to it in a for loop.
When I try to compile the code it hangs with no error message but if I remove the code RetainKeyList += elem1 I am able to print all of the elem1 string properly.
What am I doing wrong here? Is there a cleaner way to collect all the string elem1 generated in the for loop?
Long story short, your code is running on a distributed environment, so the local collection is not modified. Every week someone asks this question, please if you do not understand what are the implications of distributed computing do not use a distributed framework like Spark.
Also, you are abusing of mutability in all parts. And mutability and a distributed environment don't play nicely.
Anyway, here is a better way to solve your problem.
val retainKeysRdd = ts_rdd.map {
case (elem1, elem2) =>
val kpssSignificance = 0.05d
val dOpt = (0 to 2).find { diff =>
val testTs = differencesOfOrderD(elem2, diff)
val (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
(elem1 -> dOpt)
} collect {
case (key, None) => key
}
This returns an RDD with the retain keys. If you are really sure you need this as a local collection and that they won't blow up your memory, you can do this:
val retainKeysList = retainKeysRdd.collect().toList

Is there any way to replace nested For loop with Higher order methods in scala

I am having a mutableList and want to take sum of all of its rows and replacing its rows with some other values based on some criteria. Code below is working fine for me but i want to ask is there any way to get rid of nested for loops as for loops slows down the performance. I want to use scala higher order methods instead of nested for loop. I tried flodLeft() higher order method to replace single for loop but can not implement to replace nested for loop
def func(nVect : Int , nDim : Int) : Unit = {
var Vector = MutableList.fill(nVect,nDimn)(math.random)
var V1Res =0.0
var V2Res =0.0
var V3Res =0.0
for(i<- 0 to nVect -1) {
for (j <- i +1 to nVect -1) {
var resultant = Vector(i).zip(Vector(j)).map{case (x,y) => x + y}
V1Res = choice(Vector(i))
V2Res = choice(Vector(j))
V3Res = choice(resultant)
if(V3Res > V1Res){
Vector(i) = res
}
if(V3Res > V2Res){
Vector(j) = res
}
}
}
}
There are no "for loops" in this code; the for statements are already converted to foreach calls by the compiler, so it is already using higher-order methods. These foreach calls could be written out explicitly, but it would make no difference to the performance.
Making the code compile and then cleaning it up gives this:
def func(nVect: Int, nDim: Int): Unit = {
val vector = Array.fill(nVect, nDim)(math.random)
for {
i <- 0 until nVect
j <- i + 1 until nVect
} {
val res = vector(i).zip(vector(j)).map { case (x, y) => x + y }
val v1Res = choice(vector(i))
val v2Res = choice(vector(j))
val v3Res = choice(res)
if (v3Res > v1Res) {
vector(i) = res
}
if (v3Res > v2Res) {
vector(j) = res
}
}
}
Note that using a single for does not make any difference to the result, it just looks better!
At this point it gets difficult to make further improvements. The only parallelism possible is with the inner map call, but vectorising this is almost certainly a better option. If choice is expensive then the results could be cached, but this cache needs to be updated when vector is updated.
If the choice could be done in a second pass after all the cross-sums have been calculated then it would be much more parallelisable, but clearly that would also change the results.

Substitute while loop with functional code

I am refactoring some scala code and I am having problems with a while loop. The old code was:
for (s <- sentences){
// ...
while (/*Some condition*/){
// ...
function(trees, ...)
}
}
I've have translated that code into this one, using foldLeft to transverse sentences:
sentences./:(initialSeed){
(seed, s) =>
// ...
// Here I've replaced the while with other foldleft
trees./:(seed){
(v, n) =>
// ....
val updatedVariable = function(...., v)
}
}
Now, It may be the case that I need to stop transversing trees (The inner foldLeft before it is transverse entirely, for that I've found this question:
Abort early in a fold
But I also have the following problem:
As I transverse trees, I need to accumulate values to the variable v, function takes v and returns an updated v, called here updatedVariable. The problem is that I have the feeling that this is not a proper way of coding this functionality.
Could you recommended me a functional/immutable way of doing this?
NOTE: I've simplified the code to show the actual problem, the complete code is this:
val trainVocabulart = sentences./:(Vocabulary()){
(vocab, s) =>
var trees = s.tree
var i = 0
var noConstruction = false
trees./:(vocab){
(v, n) =>
if (i == trees.size - 1) {
if (noConstruction) return v
noConstruction = true
i = 0
} else {
// Build vocabulary
val updatedVocab = buildVocabulary(trees, v, i, Config.LeftCtx, Config.RightCtx)
val y = estimateTrainAction(trees, i)
val (newI, newTrees) = takeAction(trees, i, y)
i = newI
trees = newTrees
// Execute the action and modify the trees
if (y != Shift)
noConstruction = false
Vocabulary(v.positionVocab ++ updatedVocab.positionVocab,
v.positionTag ++ updatedVocab.positionTag,
v.chLVocab ++ updatedVocab.chLVocab,
v.chLTag ++ updatedVocab.chLTag,
v.chRVocab ++ updatedVocab.chRVocab,
v.chRTag ++ updatedVocab.chRTag)
}
v
}
}
And the old one:
for (s <- sentences) {
var trees = s.tree
var i = 0
var noConstruction = false
var exit = false
while (trees.nonEmpty && !exit) {
if (i == trees.size - 1) {
if (noConstruction) exit = true
noConstruction = true
i = 0
} else {
// Build vocabulary
buildVocabulary(trees, i, LeftCtx, RightCtx)
val y = estimateTrainAction(trees, i)
val (newI, newTrees) = takeAction(trees, i, y)
i = newI
trees = newTrees
// Execute the action and modify the trees
if (y != Shift)
noConstruction = false
}
}
}
1st - You don't make this easy. Neither your simplified or complete examples are complete enough to compile.
2nd - You include a link to some reasonable solutions to the break-out-early problem. Is there a reason why none of them look workable for your situation?
3rd - Does that complete example actually work? You're folding over a var ...
trees./:(vocab){
... and inside that operation you modify/update that var ...
trees = newTrees
According to my tests that's a meaningless statement. The original iteration is unchanged by updating the collection.
4th - I'm not convinced that fold is what you want here. fold iterates over a collection and reduces it to a single value, but your aim here doesn't appear to be finding that single value. The result of your /: is thrown away. There is no val result = trees./:(vocab){...
One solution you might look at is: trees.forall{ ... At the end of each iteration you just return true if the next iteration should proceed.

Is this correct use of .par in Scala?

Below code computes a distance metric between two Users as specified by case class :
case class User(name: String, features: Vector[Double])
val ul = for (a <- 1 to 100) yield (User("a", Vector(1, 2, 4)))
var count = 0;
def distance(userA: User, userB: User) = {
val subElements = (userA.features zip userB.features) map {
m => (m._1 - m._2) * (m._1 - m._2)
}
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
count += 1;
println("count is " + count)
((userA.name, userB.name), sqRoot)
}
val distances = ul.par.map(m => ul.map(m2 => {
(distance(m, m2))
})).toList.flatten
val sortedDistances = distances.groupBy(_._1._1).map(m => (m._1, m._2.sortBy(s => s._2)))
println(sortedDistances.get("a").get.size);
This performs a Cartesian product of comparison 100 users : 10000 comparisons. I'm counting each comparison, represented bu var count
Often the count value will be less than 10000, but the amount of items iterated over is always 10000. Is reason for this that as par spawns multiple threads some of these will finish before the println statement is executed. However all will finish within par code block - before distances.groupBy(_._1._1).map(m => (m._1, m._2.sortBy(s => s._2))) is evaluated.
In your example you have a single un-synchronized variable that you're mutating from multiple threads like you said. This means that each thread, at any time, may have a stale copy of count, so when they increment it they will squash any other writes that have occurred, resulting in a count less than it should be.
You can solve this using the synchronized function,
...
val subElements = (userA.features zip userB.features) map {
m => (m._1 - m._2) * (m._1 - m._2)
}
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
this.synchronized {
count += 1;
}
println("count is " + count)
((userA.name, userB.name), sqRoot)
...
Using 'this.synchronized' will use the containing object as the lock object. For more information on Scala synchronization I suggest reading Twitter's Scala School.

workaround for prepending to a LinkedHashMap in Scala?

I have a LinkedHashMap which I've been using in a typical way: adding new key-value
pairs to the end, and accessing them in order of insertion. However, now I have a
special case where I need to add pairs to the "head" of the map. I think there's
some functionality inside the LinkedHashMap source for doing this, but it has private
accessibility.
I have a solution where I create a new map, add the pair, then add all the old mappings.
In Java syntax:
newMap.put(newKey, newValue)
newMap.putAll(this.map)
this.map = newMap
It works. But the problem here is that I then need to make my main data structure
(this.map) a var rather than a val.
Can anyone think of a nicer solution? Note that I definitely need the fast lookup
functionality provided by a Map collection. The performance of a prepending is not
such a big deal.
More generally, as a Scala developer how hard would you fight to avoid a var
in a case like this, assuming there's no foreseeable need for concurrency?
Would you create your own version of LinkedHashMap? Looks like a hassle frankly.
This will work but is not especially nice either:
import scala.collection.mutable.LinkedHashMap
def prepend[K,V](map: LinkedHashMap[K,V], kv: (K, V)) = {
val copy = map.toMap
map.clear
map += kv
map ++= copy
}
val map = LinkedHashMap('b -> 2)
prepend(map, 'a -> 1)
map == LinkedHashMap('a -> 1, 'b -> 2)
Have you taken a look at the code of LinkedHashMap? The class has a field firstEntry, and just by taking a quick peek at updateLinkedEntries, it should be relatively easy to create a subclass of LinkedHashMap which only adds a new method prepend and updateLinkedEntriesPrepend resulting in the behavior you need, e.g. (not tested):
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = e
}
}
Here is a sample implementation I threw together real quick (that is, not thoroughly tested!):
class MyLinkedHashMap[A, B] extends LinkedHashMap[A,B] {
def prepend(key: A, value: B): Option[B] = {
val e = findEntry(key)
if (e == null) {
val e = new Entry(key, value)
addEntry(e)
updateLinkedEntriesPrepend(e)
None
} else {
// The key already exists, so we might as well call LinkedHashMap#put
put(key, value)
}
}
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = firstEntry
}
}
}
Tested like this:
object Main {
def main(args:Array[String]) {
val x = new MyLinkedHashMap[String, Int]();
x.prepend("foo", 5)
x.prepend("bar", 6)
x.prepend("olol", 12)
x.foreach(x => println("x:" + x._1 + " y: " + x._2 ));
}
}
Which, on Scala 2.9.0 (yeah, need to update) results in
x:olol y: 12
x:bar y: 6
x:foo y: 5
A quick benchmark shows order of magnitude in performance difference between the extended built-in class and the "map rewrite" approach (I used the code from Debilski's answer in "ExternalMethod" and mine in "BuiltIn"):
benchmark length us linear runtime
ExternalMethod 10 1218.44 =
ExternalMethod 100 1250.28 =
ExternalMethod 1000 19453.59 =
ExternalMethod 10000 349297.25 ==============================
BuiltIn 10 3.10 =
BuiltIn 100 2.48 =
BuiltIn 1000 2.38 =
BuiltIn 10000 3.28 =
The benchmark code:
def timeExternalMethod(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) prepend(map, (i, i))
r -= 1
}
}
def timeBuiltIn(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) map.prepend(i, i)
r -= 1
}
}
Using a scala benchmarking template.