Optimizing scala-spark code by removing "for loop" - scala

I wanted to optimize this code ( scala spark) to remove for loop . How do i do it ?
var varianceExplained = Array[(Int,Double)]();
var varExplained = Array[(Double)]();//{This one contains double values assigned before}
var sums = 0.00
for(x<-0 to varExplained.length-1)
{sums =sums+varExplained(x)
varianceExplained +:= (x,sums)
}

Not really sure how you would parallelize a set which is reliant on its preceding values... only this I can add is how to remove loops and make it a recursive function as per functional programming best practices.
def go(acc: Array[(Int, Double)], iter: Int, sums: Double): Array[(Int, Double)] ={
if (iter == varExplained.length)acc
else {
go((iter, sums+varExplained(iter)) +: acc, iter+1, sums+varExplained(iter))
}
}
go(Array[(Int, Double)](), 0, 0)

One possible solution is to convert the for loop to a translation of map.
You can try the following:
val varianceExplained = varExplained.map(elem => (elem, sums+varExplained(elem))).
In this case you do not need the varianceExplained array. You get the required Array[(Int, Double)] as a result of the map operation. I have used similar strategies at work to make my code efficient.
Also, do try using vals instead of vars in your code.

Related

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

How to search efficiently in a nested collection in a functional way

I'd like to find the indices (coordinates) of the first element whose value is 4, in a nested Vector of Int, in a functional way.
val a = Vector(Vector(1,2,3), Vector(4,5), Vector(3,8,4))
a.map(_.zipWithIndex).zipWithIndex.collect{
case (col, i) =>
col.collectFirst {
case (num, index) if num == 4 =>
(i, index)
}
}.collectFirst {
case Some(x) ⇒ x
}
It returns:
Some((0, 1))
the coordinate of the first 4 occurrence.
This solution is quite simple, but it has a performance penalty, because the nested col.collect is performed for all the elements of the top Vector, when we are only interested in the 1st match.
One possible solution is to write a guard in the pattern matching. But I don't know how to write a guard based in a slow condition, and return something that has already been calculated in the guard.
Can it be done better?
Recursive maybe?
If you insist on using Vectors, something like this will work (for a non-indexed seq, you'd need a different approach):
#tailrec
findit(
what: Int,
lists: IndexedSeq[IndexedSeq[Int]],
i: Int = 0,
j: Int = 0
): Option[(Int, Int)] =
if(i >= lists.length) None
else if(j >= lists(i).length) findit(what, lists, i+1, 0)
else if(lists(i)(j) == what) Some((i,j))
else findit(what, lists, i, j+1)
A simple thing you can to without changing the algorithm is to use Scala streams to be able to exit as soon as you find the match. Streams are lazily evaluated as opposed to sequences.
Just make a change similar to this
a.map(_.zipWithIndex.toStream).zipWithIndex.toStream.collect{ ...
In terms of algorithmic changes, if you can somehow have your data sorted (even before you start to search) then you can use Binary search instead of looking at each element.
import scala.collection.Searching._
val dummy = 123
implicit val anOrdering = new Ordering[(Int, Int, Int)]{
override def compare(x: (Int, Int, Int), y: (Int, Int, Int)): Int = Integer.compare(x._1, y._1)
}
val seqOfIntsWithPosition = a.zipWithIndex.flatMap(vectorWithIndex => vectorWithIndex._1.zipWithIndex.map(intWithIndex => (intWithIndex._1, vectorWithIndex._2, intWithIndex._2)))
val sorted: IndexedSeq[(Int, Int, Int)] = seqOfIntsWithPosition.sortBy(_._1)
val element = sorted.search((4, dummy, dummy))
This code is not very pretty or readable, I just quickly wanted to show an example of how it could be done.

How to get a Long typed production of a Seq[Int] in Scala?

Suppose val s = Seq[Int] and I would like to get the production of all its elements. The value is guaranteed to be greater than Int.MaxValue but less than Long.MaxValue so I hope the value to be a Long type.
It seems I cannot use product/foldLeft/reduceLeft due to the fact Long and Int are different types without any relations; therefore I need to write a for-loop myself. Is there any decent way to achieve this goal?
Note: I'm just asking the possibility to use builtin libraries but still fine with "ugly" code below.
def product(a: Seq[Int]): Long = {
var p = 1L
for (e <- a) p = p * e
p
}
There's no need to mess about with asInstanceOf or your own loop. foldLeft works just fine
val xs = Seq(1,1000000000,1000000)
xs.foldLeft(1L)((a,e) => a*e)
//> res0: Long = 1000000000000000
How about
def product(s: Seq[Int]) = s.map(_.asInstanceOf[Long]).fold(1L)( _ * _ )
In fact, having re-read your question and learnt about the existence of product itself, you could just do:
def product(s: Seq[Int]) = s.map(_.asInstanceOf[Long]).product

Scala most efficient operator to add to a list

Building a scala list by modifying it incrementally. What is the most efficient "add" operator? In terms of CPU and resource consumption.
For example, from List(1,2,3) we want to create a list of tuples of consecutive numbers. Which gives the result: List((1,2), (2,3))
Method 1 - using :+ operator
def createConsecutiveNumPair1[T](inList: List[T]) : List[(T, T)] = {
var listResult = List[(T, T)]()
var prev = inList(0)
for (curr <- inList.tail)
{
listResult = listResult :+ (prev, curr)
prev = curr
}
listResult
}
Method 2 - using ::= operator
def createConsecutiveNumPair2[T](inList: List[T]) : List[(T, T)] = {
var listResult = List[(T, T)]()
var prev = inList(0)
for (curr <- inList.tail)
{
listResult ::= (prev, curr)
prev = curr
}
listResult
}
TEST
scala> val l1 = List(1,2,3)
l1: List[Int] = List(1, 2, 3)
scala> createConsecutiveNumPair1(l1)
res77: List[(Int, Int)] = List((1,2), (2,3))
scala> createConsecutiveNumPair2(l1)
res78: List[(Int, Int)] = List((2,3), (1,2))
QUESTION: which operator has lowest CPU, resource consumption? Would also appreciate if you can suggest a better scala way to rewrite the method above.
The problem with your first code is that it appends, which is O(n) on List. So the algorithm basically is O(n^2).
It very efficient to prepend on List, because it runs in constant time O(1), which you do in your second algorithm. You could use that and do a reverse at the end to make the result of the two methods equal, which would roughly make it it run in O(n).
However there is already a nice method in the library that does what you want. sliding is what you are looking for. The parameter for sliding defines the size of the tuples.
This would give you a List[List[Int]]:
List(1,2,3,4).sliding(2).toList //List(List(1,2), List(2,3), List(3,4))
If you insist on tuples, you can additionally use collect or map. Be aware that map will throw an exception when the list only has one element.
List(1,2,3,4).sliding(2).collect{
case List(a,b) => (a,b)
}.toList //List((1,2), (2,3), (3,4))
These methods often just call each other (though check the implementation to be sure). List is a singly-linked list, optimized for accessing the head rather than the tail, so adding elements to the front rather than the end is much more efficient. If you want to access / add elements at the end of the list, it's better to use Vector instead.
(As always, if you're asking the question at all you should have automated tooling to be able to tell you the answer. If you're not using a profiler that tells you which parts of your app are slow, it's not worth spending time on this kind of microoptimization - you're almost certainly optimizing the wrong part).

Cross product of two Strings

I am just starting out in Scala and for my first project, I am writing a Sudoku solver. I came across a great site explaining Sudoku and how to go about writing a solver: http://norvig.com/sudoku.html and from this site I am trying to create the corresponding Scala code.
The squares of a Sudoku grid are basically the cross product of the row name and the column name, this can be generated really easily in Python using a list comprehension:
# cross("AB", "12") = ["A1", "A2", "B1", "B2"]
def cross(A, B):
"Cross product of elements in A and elements in B."
return [a+b for a in A for b in B]
It took me awhile to think about how to do this elegantly in Scala, and this is what I came up with:
// cross("AB", "12") => List[String]("A1", "A2", "B1", "B2")
def cross(r: String, c: String) = {
for(i <- r; j <- c) yield i + "" + j
}.toList
I was just curious if there is a better way to doing this in Scala? It would seem much cleaner if I could do yield i + j but that results in an Int for some reason. Any comments or suggestions would be appreciated.
Yes, addition for Char is defined by adding their integer equivalents. I think your code is fine. You could also use string interpolation, and spare the toList (you will get an immutable indexed sequence instead which is just fine):
def cross(r: String, c: String) = for(i <- r; j <- c) yield s"$i$j"
EDIT
An IndexedSeq is at least as powerful as List. Just check your successive usage of the result. Does it require a List? E.g. do you want to use head and tail and pattern match with ::. If not, there is no reason why you should need to enforce List. If you use map and flatMap on the input arguments instead of the syntactic sugar with for, you can use the collection.breakOut argument to directly map to a List:
def cross(r: String, c: String): List[String] =
r.flatMap(i => c.map(j => s"$i$j"))(collection.breakOut)
Not as pretty, but faster than an extra toList.