A proper monadic flatMap operation for spark datasets?

A proper monadic flatMap operation for spark datasets? - scala

I'm looking for a function with the following signature:
bind[A, B](f: A => Dataset[B], ds: Dataset[A]): Dataset[A]
Is there such a thing in the spark library? (flatMap unfortunately requires a mapping from A to TraversableOnce[B], which means I have to concreteize my dataset, unless I'm missing something).
If not, how would one go about implementing such a function?

RDD is not a monad.
RDD object makes sense only on a driver and map/flatmap functions are performed on workers. So you can't emit RDDs in map/flatmap operations.
Dataset is a facade to RDD so I guess it's impossible as well.

Suppose that you have N nodes, with M memory on each node.
Furthermore, assume that f(a) is roughly of same size for all a <- ds.
You say that you want f(a) to be a distributed Dataset. The only reason why you would insist on f returning a Dataset is that the returned value doesn't fit into memory on a single node, thus
|f(a)| >= M .
At the same time you assume that bind(f, ds) will fit into memory, thus
N * M >= ds.size * |f(a)| >= ds.size * M
If we cancel M, then it says:
N >= ds.size
that is, the number of elements in ds must be relatively small (smaller than the number of compute nodes). This in turn means that you can simply collect it on the master node, map it to dataset, and then take the union. Something along these lines (untested):
def bind[A, B](f: A => Dataset[B], ds: Dataset[A]): Dataset[A] = {
ds.collect.map(f).reduce(_ union _)
}
Trying to make it into a general monad does not make much sense, because if you read Dataset as "huge distributed dataset that barely fits into a huge cluster with multiple nodes", then
ds is already huge
each f(a) is huge
ds.flatMap(f) is huge to the power of two, doesn't fit into memory
Thus, a general bind can be:
impossible, because result doesn't fit into memory.
replaced by fold(f: A => TraversableOnce[B]), because f(a) is small
replaced by ds.collect, because ds is small
And you are the one who has to make the decision "what is small" in each particular case. That's probably the reason why no generic flatMap(f: A => Dataset[B]) is provided: there is nontrivial design decision that has to be made in every invocation of such flatMap.

Related

How to parallelize KMeans?

I have a DataFrame with millions of rows. I have columns A, B and C. I have to cluster C over A and B.
For now what I'm doing is a for over every A and B set. Obviously it's slow, but I can't really parallelize it. I tried using Window and pandas_udf, but it doens't work with scalar functions.
Edit: some code to help understand my problem
# suppose we have a variable my_df which is a dataframe with millions of rows
# its schema is type, code, quantity
# I have to cluster quantity by types
# hundreds of types in real case
types = [1, 2, 3]
for index, t in enumerate(types):
filtered_my_df = my_df.filter(my_df.type == t)
# then I apply kmeans on filtered_my_df...
When I have hundres or even thousands of types, this code obviously is not scalable. But I can't figure out how to do this in scalable way.

In case of Kmeans, it randomly selects k objects from the whole objects which represent initial cluster centers. Then, each remaining object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster center. Then new mean for each cluster is calculated. This process iterates until the criterion function converges. This is why, it becomes very time costly.
You can use Parallel K-Means Based on MapReduce.
The distance computations between one object with the centers is irrelevant to the distance computations between other objects with the corresponding centers. distance computations between different objects with centers can be parallel executed.
MLlib implementation of k-means corresponds to the algorithm called K-Means\5 which is a parallel version of the original one. The method header is defined as follows: Let's say parseddata is you spark RDD.
KMeans.train(k, maxIterations, initializationMode, runs)
from pyspark.mllib.clustering import KMeans
clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode='random')
# Evaluation
from math import sqrt
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = (parsedData.map(lambda point:error(point)).reduce(lambda x, y: x+y))
print('Within Set Sum of Squared Error = ' + str(WSSSE))

How does aggregate generalise fold and fold generalise reduce?

As far as I understand aggregate is a generalisation of fold which in turn is a generalisation of reduce.
Similarily combineByKey is a generalisation of aggregateByKey which in turn is a generalisation of foldByKey which in turn is a generalisation of reduceByKey.
However I have trouble finding simple examples for each of those seven methods which can in turn only be expressed by them and not their less general versions. For example I found http://blog.madhukaraphatak.com/spark-rdd-fold/ giving an example for fold, but I have been able to use reduce in the same situation as well.
What I found out so far:
I read that the more generalised methods can be more efficient, but that would be a non-functional requirement and I would like to get examples which can not be implemented with the more specific method.
I also read that e.g. the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally: https://stackoverflow.com/a/25158790/4533188 (However, I still don't know any good simple example.) whereas in https://stackoverflow.com/a/26635928/4533188 I read that fold needs both properties to hold...
We could think of the zero value as a feature (e.g. for fold over reduce) as in "add all elements and add 3" and using 3 as the zero value, but that would be misleading, because 3 would be added for each partition, not just once. Also this is simply not the purpose of fold as far as I understood - it wasn't meant as a feature, but as a necessity to implement it to be able to take non-commutative functions.
What would simple examples for those seven methods be?

Let's work through what is actually needed logically.
First, note that if your collection is unordered, any set of (binary) operations on it need to be both commutative and associative, or you'll get different answers depending on which (arbitrary) order you choose each time. Since reduce, fold, and aggregate all use binary operations, if you use these things on a collection that is unordered (or is viewed as unordered), everything must be commutative and associative.
reduce is an implementation of the idea that if you can take two things and turn them into one thing, you can collapse an arbitrarily long collection into a single element. Associativity is exactly the property that it doesn't matter how you pair things up as long as you eventually pair them all and keep the left-to-right order unchanged, so that's exactly what you need.
a b c d a b c d a b c d
a # b c d a # b c d a b # c d
(a#b) c # d (a#b) # c d a (b#c) d
(a#b) # (c#d) ((a#b)#c) # d a # ((b#c)#d)
All of the above are the same as long as the operation (here called #) is associative. There is no reason to swap around which things go on the left and which go on the right, so the operation does not need to be commutative (addition is: a+b == b+a; concat is not: ab != ba).
reduce is mathematically simple and requires only an associative operation
Reduce is limited, though, in that it doesn't work on empty collections, and in that you can't change the type. If you're working sequentially, you can a function that takes a new type and the old type, and produces something with the new type. This is a sequential fold (left-fold if the new type goes on the left, right-fold if it goes on the right). There is no choice about the order of operations here, so commutativity and associativity and everything are irrelevant. There's exactly one way to work through your list sequentially. (If you want your left-fold and right-fold to always be the same, then the operation must be associative and commutative, but since left- and right-folds don't generally get accidentally swapped, this isn't very important to ensure.)
The problem comes when you want to work in parallel. You can't sequentially go through your collection; that's not parallel by definition! So you have to insert the new type at multiple places! Let's call our fold operation #, and we'll say that the new type goes on the left. Furthermore, we'll say that we always start with the same element, Z. Now we could do any of the following (and more):
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d
((Z#a)#b) # c d
(((Z#a)#b)#c) # d
Now we have a collection of one or more things of the new type. (If the original collection was empty, we just take Z.) We know what to do with that! Reduce! So we make a reduce operation for our new type (let's call it $, and remember it has to be associative), and then we have aggregate:
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d Z#a $ Z#b Z#c $ Z#d
((Z#a)#b) # c d ((Z#a)#b) $ ((Z#c)#d) ((Z#a)$(Z#b)) $ ((Z#c)$(Z#d))
(((Z#a)#b)#c) # d
Now, these things all look really different. How can we make sure that they end up to be the same? There is no single concept that describes this, but the Z# operation has to be zero-like and $ and # have to be homomorphic, in that we need (Z#a)#b == (Z#a)$(Z#b). That's the actual relationship that you need (and it is technically very similar to a semigroup homomorphism). There are all sorts of ways to pick badly even if everything is associative and commutative. For example, if Z is the double value 0.0 and # is actually +, then Z is zero-like and # is associative and commutative. But if $ is actually *, which is also associative and commutative, everything goes wrong:
(0.0+2) * (0.0+3) == 2.0 * 3.0 == 6.0
((0.0+2) + 3) == 2.0 + 3 == 5.0
One example of a non-trival aggregate is building a collection, where # is the "append an element" operator and $ is the "concat two collections" operation.
aggregate is tricky and requires an associative reduce operation, plus a zero-like value and a fold-like operation that is homomorphic to the reduce
The bottom line is that aggregate is not simply a generalization of reduce.
But there is a simplification (less general form) if you're not actually changing the type. If Z is actually z and is an actual zero, we can just stick it in wherever we want and use reduce. Again, we don't need commutativity conceptually; we just stick in one or more z's and reduce, and our # and $ operations can be the same thing, namely the original # we used on the reduce
a b c d () <- empty
z#a z#b z
z#a (z#b)#c
z#a ((z#b)#c)#d
(z#a)#((z#b)#c)#d
If we just delete the z's from here, it works perfectly well, and in fact is equivalent to if (empty) z else reduce. But there's another way it could work too. If the operation # is also commutative, and z is not actually a zero but just occupies a fixed point of # (meaning z#z == z but z#a is not necessarily just a), then you can run the same thing, and since commutivity lets you switch the order around, you conceptually can reorder all the z's together at the beginning, and then merge them all together.
And this is a parallel fold, which is really a rather different beast than a sequential fold.
(Note that neither fold nor aggregate are strictly generalizations of reduce even for unordered collections where operations have to be associative and commutative, as some operations do not have a sensible zero! For instance, reducing strings by shortest length has as its "zero" the longest possible string, which conceptually doesn't exist, and practically is an absurd waste of memory.)
fold requires an associative reduce operation plus either a zero value or a reduce operation that's commutative plus a fixed-point value
Now, when would you ever use a parallel fold that wasn't just a reduceOrElse(zero)? Probably never, actually, though they can exist. For example, if you have a ring, you often have fixed points of the type we need. For instance, 10 % 45 == (10*10) % 45, and * is both associative and commutative in integers mod 45. Thus, if our collection is numbers mod 45, we can fold with a "zero" of 10 and an operation of *, and parallelize however we please while still getting the same result. Pretty weird.
However, note that you can just plug the zero and operation of fold into aggregate and get exactly the same result, so aggregate is a proper generalization of fold.
So, bottom line:
Reduce requires only an associative merge operation, but doesn't change the type, and doesn't work on empty collecitons.
Parallel fold tries to extend reduce but requires a true zero, or a fixed point and the merge operation must be commutative.
Aggregate changes the type by (conceptually) running sequential folds followed by a (parallel) reduce, but there are complex relationships between the reduce operation and the fold operation--basically they have to be doing "the same thing".
An unordered collection (e.g. a set) always requires an associative and commutative operation for any of the above.
With regard to the byKey stuff: it's just the same as this, except it only applies it to the collection of values associated with a (potentially repeated) key.
If Spark actually requires commutativity where the above analysis does not suggest it's needed, one could reasonably consider that a bug (or at least an unnecessary limitation of the implementation, given that operations like map and filter preserve order on ordered RDDs).

the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally.
It is not correct. fold on RDDs requires the function to be commutative as well. It is not the same operation as fold on Iterable what is pretty well described in the official documentation:
This behaves somewhat differently from fold operations implemented for non-distributed
collections in functional languages like Scala.
This fold operation may be applied to
partitions individually, and then fold those results into the final result, rather than
apply the fold to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold applied to a
non-distributed collection.
As you can see order of merging partial values is not part of the contract hence function which is used for fold has to be commutative.
I read that the more generalised methods can be more efficient
Technically speaking there should be no significant difference. For fold vs reduce you can check my answers to reduce() vs. fold() in Apache Spark and Why is the fold action necessary in Spark?
Regarding *byKey methods all are implemented using the same basic construct which is combineByKeyWithClassTag and can be reduced to three simple operations:
createCombiner - create "zero" value for a given partition
mergeValue - merge values into accumulator
mergeCombiners - merge accumulators created for each partition.

Apache Spark flatMap time complexity

I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code

Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.

How to turn a known structured RDD to Vector

Assuming I have an RDD containing (Int, Int) tuples.
I wish to turn it into a Vector where first Int in tuple is the index and second is the value.
Any Idea how can I do that?
I update my question and add my solution to clarify:
My RDD is already reduced by key, and the number of keys is known.
I want a vector in order to update a single accumulator instead of multiple accumulators.
There for my final solution was:
reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
val v = Array(0,0,0,0)
v(x) = y
accumulator += new Vector(v)
}}))
Using Vector from accumulator example in documentation.

rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}
Turn the RDD into a Map. Then iterate over that, building a Vector as we go.
You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.

One key thing: do you really need Vector? Map could be much more suitable.
If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.
Hope this is clear enough.

In what way is Scala's Option fold a catamorphism?

The answer to this question suggests that the fold method on Option in Scala is a catamoprhism. From the wikipedia a catamophism is "the unique homomorphism from an initial algebra into some other algebra. The concept has been applied to functional programming as folds". So that seems fair, but leads me to an initial algebra as the initial object in the category of F-algebras.
So if the fold on Option is really a catamophism there needs to be some functor F, to create the category of F-algebras where Option would be the initial object. I can't figure out what this functor would be.
For Lists of type A the functor F is F[X] = 1 + A * X. This makes sense because List is a recursive data type, so if X is List[A] then the above reads that a list of type A is either the empty list (1), or (+) a pair (*) of an A and a List[A]. But Option isn't recursive. Option[A] would just be 1 + A (Nothing or an A). So I don't see where the functor is.
Just to be clear I realize that Option is already a functor, in that it takes A to Option[A], but what is done for lists is different, the A is fixed and the functor is used to describe how to recursively construct the data type.
On a related note, if it is not a catamorphism it probably shouldn't be called a fold, as that leads to some confusion.

Well, the comments are in the right track. I'm just a beginner so I probably have some misconceptions. Yes, the whole point is to be able to model recursive types, but I think nothing precludes a "non-recursive" F-algebra. Since the initial algebra is the "least fixed point" solution to the equation X ~= F X. In the case of Option, the solution is trivial, as there's no recursion involved :)
Other examples of initial algebras:
List[X] = 1 + A * X to represent list = Nil | Cons a list
Tree[X] = A + A * X * X to represent tree = Leaf a | Node a tree tree
In the same way:
Option[X] = 1 + A to represent option = None | Some a
The justification for the existence of a "constant" functor is pretty easy, how do you represent a tree's node?
In fact, to algebraically model (simple) recursive datatypes you need only the following functors:
U (Unit, represents empty)
K (Constant, captures a value)
I (Identity, represent the recursive position)
* (product)
+ (coproduct)
A good reference I found is Functional Generic Programming
Shameless plug: I'm playing with those concepts in code in scala-reggen