Apache Spark flatMap time complexity - scala

I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code

Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.

Related

How to parallelize KMeans?

I have a DataFrame with millions of rows. I have columns A, B and C. I have to cluster C over A and B.
For now what I'm doing is a for over every A and B set. Obviously it's slow, but I can't really parallelize it. I tried using Window and pandas_udf, but it doens't work with scalar functions.
Edit: some code to help understand my problem
# suppose we have a variable my_df which is a dataframe with millions of rows
# its schema is type, code, quantity
# I have to cluster quantity by types
# hundreds of types in real case
types = [1, 2, 3]
for index, t in enumerate(types):
filtered_my_df = my_df.filter(my_df.type == t)
# then I apply kmeans on filtered_my_df...
When I have hundres or even thousands of types, this code obviously is not scalable. But I can't figure out how to do this in scalable way.
In case of Kmeans, it randomly selects k objects from the whole objects which represent initial cluster centers. Then, each remaining object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster center. Then new mean for each cluster is calculated. This process iterates until the criterion function converges. This is why, it becomes very time costly.
You can use Parallel K-Means Based on MapReduce.
The distance computations between one object with the centers is irrelevant to the distance computations between other objects with the corresponding centers. distance computations between different objects with centers can be parallel executed.
MLlib implementation of k-means corresponds to the algorithm called K-Means\5 which is a parallel version of the original one. The method header is defined as follows: Let's say parseddata is you spark RDD.
KMeans.train(k, maxIterations, initializationMode, runs)
from pyspark.mllib.clustering import KMeans
clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode='random')
# Evaluation
from math import sqrt
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = (parsedData.map(lambda point:error(point)).reduce(lambda x, y: x+y))
print('Within Set Sum of Squared Error = ' + str(WSSSE))

A proper monadic flatMap operation for spark datasets?

I'm looking for a function with the following signature:
bind[A, B](f: A => Dataset[B], ds: Dataset[A]): Dataset[A]
Is there such a thing in the spark library? (flatMap unfortunately requires a mapping from A to TraversableOnce[B], which means I have to concreteize my dataset, unless I'm missing something).
If not, how would one go about implementing such a function?
RDD is not a monad.
RDD object makes sense only on a driver and map/flatmap functions are performed on workers. So you can't emit RDDs in map/flatmap operations.
Dataset is a facade to RDD so I guess it's impossible as well.
Suppose that you have N nodes, with M memory on each node.
Furthermore, assume that f(a) is roughly of same size for all a <- ds.
You say that you want f(a) to be a distributed Dataset. The only reason why you would insist on f returning a Dataset is that the returned value doesn't fit into memory on a single node, thus
|f(a)| >= M .
At the same time you assume that bind(f, ds) will fit into memory, thus
N * M >= ds.size * |f(a)| >= ds.size * M
If we cancel M, then it says:
N >= ds.size
that is, the number of elements in ds must be relatively small (smaller than the number of compute nodes). This in turn means that you can simply collect it on the master node, map it to dataset, and then take the union. Something along these lines (untested):
def bind[A, B](f: A => Dataset[B], ds: Dataset[A]): Dataset[A] = {
ds.collect.map(f).reduce(_ union _)
}
Trying to make it into a general monad does not make much sense, because if you read Dataset as "huge distributed dataset that barely fits into a huge cluster with multiple nodes", then
ds is already huge
each f(a) is huge
ds.flatMap(f) is huge to the power of two, doesn't fit into memory
Thus, a general bind can be:
impossible, because result doesn't fit into memory.
replaced by fold(f: A => TraversableOnce[B]), because f(a) is small
replaced by ds.collect, because ds is small
And you are the one who has to make the decision "what is small" in each particular case. That's probably the reason why no generic flatMap(f: A => Dataset[B]) is provided: there is nontrivial design decision that has to be made in every invocation of such flatMap.

How to turn a known structured RDD to Vector

Assuming I have an RDD containing (Int, Int) tuples.
I wish to turn it into a Vector where first Int in tuple is the index and second is the value.
Any Idea how can I do that?
I update my question and add my solution to clarify:
My RDD is already reduced by key, and the number of keys is known.
I want a vector in order to update a single accumulator instead of multiple accumulators.
There for my final solution was:
reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
val v = Array(0,0,0,0)
v(x) = y
accumulator += new Vector(v)
}}))
Using Vector from accumulator example in documentation.
rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}
Turn the RDD into a Map. Then iterate over that, building a Vector as we go.
You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.
One key thing: do you really need Vector? Map could be much more suitable.
If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.
Hope this is clear enough.

Recursive GO vs Scala

The following Scala code complete in 1.5 minutes while the equivalent code in GO finish in 2.5 minutes.
Up to fib(40) both take 2 sec. The gap appear in fib(50)
I got the impression the GO, being native, should be faster then Scala.
Scala
def fib(n:Int):Long = {
n match {
case 0 => 0
case 1 => 1
case _ => fib(n-1) + fib(n-2)
}
}
GO
func fib(n int) (ret int) {
if n > 1 {
return fib(n-1) + fib(n-2)
}
return n
}
Scala optimization?
Golang limitation?
As "My other car is a cadr" said the question is "how come Scala is faster than GO in this particular microbenchmark?"
Forget the Fibonacci lets say I do have a function that require recursion.
Is Scala superior in recursion situations?
Its probably an internal compiler implementation or even Scala specific optimization.
Please answer just if you know.
Go in loop run 15000000000 in 12 sec
func fib(n int) (two int) {
one := 0
two = 1
for i := 1; i != n; i++ {
one, two = two, (one + two)
}
return
}
For Go, use iteration not recursion. Recursion can be replaced by iteration with an explicit stack. It avoids the overhead of function calls and call stack management. For example, using iteration and increasing n from 50 to 1000 takes almost no time:
package main
import "fmt"
func fib(n int) (f int64) {
if n < 0 {
n = 0
}
a, b := int64(0), int64(1)
for i := 0; i < n; i++ {
f = a
a, b = b, a+b
}
return
}
func main() {
n := 1000
fmt.Println(n, fib(n))
}
Output:
$ time .fib
1000 8261794739546030242
real 0m0.001s
user 0m0.000s
sys 0m0.000s
Use appropriate algorithms. Avoid exponential time complexity. Don't use recursion for Fibonacci numbers when performance is important.
Reference:
Recursive Algorithms in Computer Science Courses: Fibonacci Numbers
and Binomial Coefficients
We observe that the computational inefficiency of branched recursive
functions was not appropriately covered in almost all textbooks for
computer science courses in the first three years of the curriculum.
Fibonacci numbers and binomial coefficients were frequently used as
examples of branched recursive functions. However, their exponential
time complexity was rarely claimed and never completely proved in the
textbooks. Alternative linear time iterative solutions were rarely
mentioned. We give very simple proofs that these recursive functions
have exponential time complexity.
Recursion is an efficient technique for definitions and algorithms
that make only one recursive call, but can be extremely inefficient if
it makes two or more recursive calls. Thus the recursive approach is
frequently more useful as a conceptual tool rather than as an
efficient computational tool. The proofs presented in this paper were
successfully taught (over a five-year period) to first year students
at the University of Ottawa. It is suggested that recursion as a
problem solving and defining tool be covered in the second part of the
first computer science course. However, recursive programming should
be postponed for the end of the course (or perhaps better at the
beginning of the second computer science course), after iterative
programs are well mastered and stack operation well understood.
The Scala solution will consume stack, since it's not tail recursive (the addition happens after the recursive call), but it shouldn't be creating any garbage at all.
Most likely whichever Hotspot compiler you're using (probably server) is just a better compiler, for this code pattern, than the Go compiler.
If you're really curious, you can download a debug build of the JVM, and have it print out the assembly code.

Strange Behaviour Using Scala Parallel Collections and setParallelism

I recently found out about Parallel Collection in Scala 2.9 and was excited to see that the degree of parallelism can be set using collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism.
However when I tried an experiment of adding two vectors of size one million each , I find
Using parallel collection with parallelism set to 64 is as fast as sequential (Shown in results).
Increasing setParallelism seems to increase performance in a non-linear way. I would have atleast expected monotonic behaviour (That is performance should not degrade if I increase parallelism)
Can some one explain why this is happening
object examplePar extends App{
val Rnd = new Random()
val numSims = 1
val x = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val y = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val parInt = List(1,2,4,8,16,32,64,128,256)
var avg:Double = 0.0
var currTime:Long = 0
for(j <- parInt){
collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(j)
avg = 0.0
for (k <- 1 to numSims){
currTime = System.currentTimeMillis()
(x zip y).par.map(x => x._1 + x._2)
avg += (System.currentTimeMillis() - currTime)
}
println("Average Time to execute with Parallelism set to " + j.toString + " = "+ (avg/numSims).toString + "ms")
}
currTime = System.currentTimeMillis()
(x zip y).map(x => x._1 + x._2)
println("Time to execute using Sequential = " + (System.currentTimeMillis() - currTime).toString + "ms")
}
The results on running the example using Scala 2.9.1 and a four core processor is
Average Time to execute with Parallelism set to 1 = 1047.0ms
Average Time to execute with Parallelism set to 2 = 594.0ms
Average Time to execute with Parallelism set to 4 = 672.0ms
Average Time to execute with Parallelism set to 8 = 343.0ms
Average Time to execute with Parallelism set to 16 = 375.0ms
Average Time to execute with Parallelism set to 32 = 391.0ms
Average Time to execute with Parallelism set to 64 = 406.0ms
Average Time to execute with Parallelism set to 128 = 813.0ms
Average Time to execute with Parallelism set to 256 = 469.0ms
Time to execute using Sequential = 406ms
Though these results are for one run, they are consistent when averaged over more runs
Parallelism does not come free. It requires extra cycles to split the problem into smaller chunks, organize everything, and synchronize the result.
You can picture this as calling all your friends to help you move, waiting for them to get there, helping you load the truck, then taking them out to lunch, and finally, getting on with your task.
In your test case you are adding two doubles, which is a trivial exercise and takes so little time that overhead from parallelization is greater than simply doing the task in a one thread.
Again, the analogy would be to call all your friends to help you move 3 suitcases. It would take you half a day to get rid of them, while you could finish by yourself in minutes.
To get any benefit from parallelization your task has to be complicated enough to warrant the extra overhead. Try doing some expensive calculations, for example a formula involving a mix of 5-10 trigonometric and logarithmic functions.
I would suggest studying and using the scala.testing.Benchmark trait to benchmark snippets of code. You have to take JIT, GC and other things into account when benchmarking on the JVM - see this paper. In short, you have to do each of the runs in the separate JVM after doing a couple of warm-up runs.
Also, note that the (x zip y) part does not occur in parallel, because x and y are not yet parallel - the zip is done sequentially. Next, I would suggest turning x and y into arrays (toArray) and then calling par - this will ensure that the benchmark uses parallel arrays, rather than parallel vectors (which are slower for transformer methods such as zip and map).