Assuming I have an RDD containing (Int, Int) tuples.
I wish to turn it into a Vector where first Int in tuple is the index and second is the value.
Any Idea how can I do that?
I update my question and add my solution to clarify:
My RDD is already reduced by key, and the number of keys is known.
I want a vector in order to update a single accumulator instead of multiple accumulators.
There for my final solution was:
reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
val v = Array(0,0,0,0)
v(x) = y
accumulator += new Vector(v)
}}))
Using Vector from accumulator example in documentation.
rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}
Turn the RDD into a Map. Then iterate over that, building a Vector as we go.
You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.
One key thing: do you really need Vector? Map could be much more suitable.
If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.
Hope this is clear enough.
Related
If I have an Array[Array[Double]] in Scala, is there an idomatic way to map over the second axis?
For instance, consider the following matrix:
val M : Array[Array[Double]] = Array(Array(1d,2d),Array(3d,4d),Array(5d,6d))
To normalize the rows I simply run:
M.map(x=>x.map(_/x.sum))
However, the normalize the columns it seems like I must execute:
M.transpose.map(x=>x.map(_/x.sum)).transpose
This is workable, but it becomes extremely tedious if I have more than two indices. In generally if I want to map over the last axis of a bunch of nested Array, i.e., Array[Array[...Array[Double]...]], then I need to bubble the last axis to the front via map and transpose, then map over it, then bubble it back to the back.
I have two DStreams. Let A:DStream[X] and B:DStream[Y].
I want to get the cartesian product of them, in other words, a new C:DStream[(X, Y)]
containing all the pairs of X and Y values.
I know there is a cartesian function for RDDs. I was only able to find this similar question but it's in Java and so does not answer my question.
The Scala equivalent of the linked question's answer (ignoring Time v3, which isn't used there) is
A.transformWith(B, (rddA: RDD[X], rddB: RDD[Y]) => rddA.cartesian(rddB))
or shorter
A.transformWith(B, (_: RDD[X]).cartesian(_: RDD[Y]))
I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code
Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.
Scaladocs explain how to add an element to a Vector.
def :+(elem: A): Vector[A]
[use case] A copy of this vector with an element appended.
Example:
scala> Vector(1,2) :+ 3
res12: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
For a large collection, it seems expensive to copy the whole Vector, and then add an element to it.
What's the best(fastest) way to add an element to a Vector?
Concatenation to an immutable Vector is O(logN). Take a look at this paper to see how it is done.
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf
If you're going to be doing a lot of appends you should use a Queue as it guarantees constant time append. For information on the time complexity of collections you can refer to this cheat sheet.
http://www.scala-lang.org/docu/files/collections-api/collections_40.html
Appending to a vector in Scala takes effectively constant time. The vector is copied in the sense that many of its data structures are reused, not in the sense that all of the elements are copied into a new vector. See the link provided by coltfred for more information about time complexity of collections:
http://www.scala-lang.org/docu/files/collections-api/collections_40.html
I am trying to create a tensor (can be conceived as a multidimensional array) package in scala. So far I was storing the data in a 1D Vector and doing index arithmetic.
But slicing and subarrays are not so easy to get. One needs to do a lot of arithmetic to convert multidimensional indices to 1D indices.
Is there any optimal way of storing a multidimensional array? If not, i.e. 1D array is the best solution, how one can optimally slice arrays (some concrete code would really help me)?
The key to answering this question is: when is pointer indirection faster than arithmetic? The answer is pretty much never. In-order traversals can be about equally fast for 2D, and things get worse from there:
2D random access
Array of Arrays - 600 M / second
Multiplication - 1.1 G / second
3D in-order
Array of Array of Arrays - 2.4G / second
Multiplication - 2.8 G / second
(etc.)
So you're better off just doing the math.
Now the question is how to do slicing. Initially, if you have dimensions of n1, n2, n3, ... and indices of i1, i2, i3, ..., you compute the offset into the array by
i = i1 + n1*(i2 + n2*(i3 + ... ))
where typically i1 is chosen to be the last (innermost) dimension (but in general it should be the dimension most often in the innermost loop). That is, if it were an array of arrays of (...), you would index into it as a(...)(i3)(i2)(i1).
Now suppose you want to slice this. First, you might give an offset o1, o2, o3 to every index:
i = (i1 + o1) + n1*((i2 + o2) + n2*((i3 + o3) + ...))
and then you will have a shorter range on each (let's call these m1, m2, m3, ...).
Finally, if you eliminate a dimension entirely--let's say, for example, that m2 == 1, meaning that i2 == 0, you just simplify the formula:
i = (i1 + o1 + n1*o2) + (n1+n2)*((i3 + o3) + ... ))
I will leave it as an exercise to the reader to figure out how to do this in general, but note that we can store new constants o1 + n1*o21 and n1+n2 so we don't need to keep doing that math on the slice.
Finally, if you are allowing arbitrary dimensions, you just put that math into a while loop. This does, admittedly, slow it down a little bit, but you're still at least as well off as if you'd used a pointer dereference (in almost every case).
From my own general experience: If you have to write a multidimensional (rectangular) array class yourself, do not aim to store the data as Array[Array[Double]] but use a one-dimensional storage and add helper methods for converting the multidimensional access tuples to a simple index and vice versa.
When using lists of lists, you need to do much to much bookkeeping that all lists are of the same size and you need to be careful when assigning a sublist to another sublist (because this makes the assigned to sublist identical to the first and you wonder why changing the item at (0,5) also changes (3,5)).
Of course, if you expect a certain dimension to be sliced much more often than another and you want to have reference semantics for that dimension as well, a list of lists will be the better solution, as you may pass around those inner lists as a slice to the consumer without making any copy. But if you don’t expect that, it is a better solution to add a proxy class for the slices which maps to the multidimensional array (which in turn maps to the one-dimensional storage array).
Just an idea: how about a map with Int-tuples as keys?
Example:
val twoDimMatrix = Map((1,1) -> -1, (1,2) -> 5, (2,1) -> 7.7, (2,2) -> 9)
and then you could
scala> twoDimMatrix.filterKeys{_._2 == 1}.values
res1: Iterable[AnyVal] = MapLike(-1, 7.7)
or
twoDimMatrix.filterKeys{tuple => { val (dim1, dim2) = tuple; dim1 == dim2}} //diagonal
this way the index arithmetics would be done by the map. I don't know how practical and fast this is though.
As soon as the number of dimension is known before the design, you can use a collection of collection ...(n times) of collection. If you must be able to build a verctor for any number of dimension, then, there's nothing convenient in the scala API to do it (as far as I know).
You can simply store information in a mulitdimensional array (eg. `Array[Array[Double]]).
If the tensors are small and can fit in cache, you can have a performance improvement with 1D arrays because of data memory locality. It should also be faster to copy the whole tensor.
For slicing arithmetic. It depends what kind of slicing you require. I suppose you already have a function for extracting an element based on indices. So write a basic splicing loop based on indices iteration, insert manually the expression for extracting element, and then try to simplify the whole loop. It is often simpler than to write a correct expression from scratch.