Scala ParArray Sorting - scala

How to sort in ascending order a ParArray collection such as
ParArray(1,3,2)
or else, which parallel collections may be more suitable for this purpose ?
Update
How to implement a parallel algorithm on ParArray that may prove more efficient than casting to a non parallel collection for sequential sorting ?

How to implement a parallel algorithm on ParArray that may prove more
efficient than casting to a non parallel collection for sequential
sorting?
My first obvervation would be that there doesn't seem to be much performance penalty for "converting" parallel arrays to sequential and back:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val diff: Long = t1 - t0
println(s"Elapsed time: ${diff * 1.0 / 1E9}s")
result
}
def main(args: Array[String]): Unit = {
val size: Int = args.headOption.map(_.toInt).getOrElse(1000000)
val input = Array.fill(size)(Random.nextInt())
val arrayCopy: Array[Int] = Array.ofDim(size)
input.copyToArray(arrayCopy)
time { input.sorted }
val parArray = arrayCopy.par
val result = time { parArray.seq.sorted.toArray.par }
}
gives
> run 1000000
[info] Running Runner 1000000
Elapsed time: 0.344659236s
Elapsed time: 0.321363896s
For all Array sizes I tested the results are very similar and usually somehow in favor of the second expression. So in case you were worried that converting to sequential collections and back will kill the performance gains you achieved on other operations - I don't think you should be.
When it comes to utilizing Scala's parallel collections to achieve parallel sorting that in some cases would perform better than the default - I don't think there's an obvious good way of doing that, but it wouldn't hurt to try:
What I thought should work would be splitting the input array into as many subarrays as you have cores in your computer (preferably without any unnecessary copying) and sorting the parts concurrently. Afterwards one might merge (as in merge sort) the parts together. Here's how the code might look like:
val maxThreads = 8 //for simplicity we're not configuring the thread pool explicitly
val groupSize:Int = size/maxThreads + 1
val ranges: IndexedSeq[(Int, Int)] = (0 until maxThreads).map(i => (i * groupSize, (i + 1) * groupSize))
time {
//parallelizing sorting for each range
ranges.par.foreach {case (from, to) =>
input.view(from, to).sortWith(_ < _)
}
//TODO merge the parts together
}
Unfortunately there's this old bug that prevents us from doing anything fun with views. There doesn't seem to be any Scala built-in mechanism (other than views) for sorting just a part of a collection. This is why I tried coding my own merge sort algorithm with the signature of def mergeSort(a: Array[Int], r: Range): Unit to use it as I described above. Unfortunately it seems to be more than 4 times less effective than the scala Array.sorted method so I don't think it could be used to gain efficiency over the standard sequential approach.
If I understand your situation correctly, your dataset fits in memory, so using something like Hadoop and MapReduce would be premature. What you might try though would be Apache Spark - other than adding a dependency you wouldn't need to set up any cluster or install anything for Spark to use all cores of your machine in a basic configuration. Its RDD's are ideologically similar to Scala's Parallel Collections, but with additional functionalities. And they (in a way) support parallel sorting.

If you build your Scala project against Java 8, there is the new Arrays.parallelSort you can use:
def sort[T <: Comparable](parArray: ParArray[T])(implicit c: ClassTag[T]): ParArray[T] = {
var array = new Array[T](parArray.size) // Or, to prevent copying, var array = parArray.seq.array.asInstanceOf[Array[T]] might work?
parArray.copyToArray(array)
java.util.Arrays.parallelSort(array)
ParArray.createFromCopy(array)
}

There are no parallel sorting algorithms available in the Scala standard library. For this reason, the parallel collection don't provide sorted, sortBy, or sortWith methods. You will have to convert to an appropriate sequential class (e.g. with toArray) before sorting.

If your data can fit in memory, then single thread in memory sort is fast enough. If you need to load a lot of data from disk or HDFS, then you can do the sort on a distributed system like hadoop or spark.

def parallelSort[A : Ordering](seq: ParIterable[A]): TreeSet[A] = {
seq.aggregate[TreeSet[A]](TreeSet.empty[A])(
(set, a) => set + a,
(set, set) => set ++ set)
}

Related

Spark Dataset equivalent for scala's "collect" taking a partial function

Regular scala collections have a nifty collect method which lets me do a filter-map operation in one pass using a partial function. Is there an equivalent operation on spark Datasets?
I'd like it for two reasons:
syntactic simplicity
it reduces filter-map style operations to a single pass (although in spark I am guessing there are optimizations which spot these things for you)
Here is an example to show what I mean. Suppose I have a sequence of options and I want to extract and double just the defined integers (those in a Some):
val input = Seq(Some(3), None, Some(-1), None, Some(4), Some(5))
Method 1 - collect
input.collect {
case Some(value) => value * 2
}
// List(6, -2, 8, 10)
The collect makes this quite neat syntactically and does one pass.
Method 2 - filter-map
input.filter(_.isDefined).map(_.get * 2)
I can carry this kind of pattern over to spark because datasets and data frames have analogous methods.
But I don't like this so much because isDefined and get seem like code smells to me. There's an implicit assumption that map is receiving only Somes. The compiler can't verify this. In a bigger example, that assumption would be harder for a developer to spot and the developer might swap the filter and map around for example without getting a syntax error.
Method 3 - fold* operations
input.foldRight[List[Int]](Nil) {
case (nextOpt, acc) => nextOpt match {
case Some(next) => next*2 :: acc
case None => acc
}
}
I haven't used spark enough to know if fold has an equivalent so this might be a bit tangential.
Anyway, the pattern match, the fold boiler plate and the rebuilding of the list all get jumbled together and it's hard to read.
So overall I find the collect syntax the nicest and I'm hoping spark has something like this.
The answers here are incorrect, at least with the current of Spark.
RDDs do in fact have a collect method that takes a partial function and applies a filter & map to the data. This is completely different from the parameterless .collect() method. See the Spark source code RDD.scala # line 955:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
This does not materialize the data from the RDD, as opposed to the parameterless .collect() method in RDD.scala # line 923:
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
In the documentation, notice how the
def collect[U](f: PartialFunction[T, U]): RDD[U]
method does not have a warning associated with it about the data being loaded into the driver's memory:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD#collect[U](f:PartialFunction[T,U])(implicitevidence$29:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]
It's very confusing for Spark to have these overloaded methods doing completely different things.
edit: My mistake! I misread the question, we're talking about DataSets not RDDs. Still, the accepted answer says that
"the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
Which is incorrect! The data is not loaded into the driver's memory when calling the partial function version of .collect() - only when calling the parameterless version. Calling .collect(partial_function) should have about the same performance as calling .filter() and .map() sequentially, as shown in the source code above.
Just for the sake of completeness:
The RDD API does have such a method, so it's always an option to convert a given Dataset / DataFrame to RDD, perform the collect operation and convert back, e.g.:
val dataset = Seq(Some(1), None, Some(2)).toDS()
val dsResult = dataset.rdd.collect { case Some(i) => i * 2 }.toDS()
However, this will probably perform worse than using a map and filter on the Dataset (for the reason explained in #stefanobaghino's answer).
As for DataFrames, this particular example (using Option) is somewhat misleading, as the conversion into a DataFrame actually does the "flatenning" of Options into their values (or null for None), so the equivalent expression would be:
val dataframe = Seq(Some(1), None, Some(2)).toDF("opt")
dataframe.withColumn("opt", $"opt".multiply(2)).filter(not(isnull($"opt")))
Which, I think, suffers less from your concerns of having the map operation "assume" anything about its input.
The collect method defined over RDDs and Datasets is used to materialize the data in the driver program.
Despite not having something akin to the Collections API collect method, your intuition is right: since both operations are evaluated lazily, the engine has the opportunity to optimize the operations and chain them so that they are performed with maximum locality.
For the use case you mentioned in particular I would suggest you take flatMap in consideration, which works on both RDDs and Datasets:
// Assumes the usual spark-shell environment
// sc: SparkContext, spark: SparkSession
val collection = Seq(Some(1), None, Some(2), None, Some(3))
val rdd = sc.parallelize(collection)
val dataset = spark.createDataset(rdd)
// Both operations will yield `Array(2, 4, 6)`
rdd.flatMap(_.map(_ * 2)).collect
dataset.flatMap(_.map(_ * 2)).collect
// You can also express the operation in terms of a for-comprehension
(for (option <- rdd; n <- option) yield n * 2).collect
(for (option <- dataset; n <- option) yield n * 2).collect
// The same approach is valid for traditional collections as well
collection.flatMap(_.map(_ * 2))
for (option <- collection; n <- option) yield n * 2
EDIT
As correctly pointed out in another question, RDDs actually have the collect method that transforms an RDD by applying a partial function just like it happens in normal collections. As the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
I just wanted to extend stefanobaghino's answer by including an example of a for comprehension with a case class as many use cases for this will probably involve case classes.
Also options are monads which makes the accepted answer very simple in this case as the for neatly drops out the None values, but that approach wouldn't extend to non-monads like case classes:
case class A(b: Boolean, i: Int, d: Double)
val collection = Seq(A(true, 3), A(false, 10), A(true, -1))
val rdd = ...
val dataset = ...
// Select out and double all the 'i' values where 'b' is true:
for {
A(b, i, _) <- dataset
if b
} yield i * 2
You can always create your own extension method:
implicit class DatasetOps[T](ds: Dataset[T]) {
def collectt[U](pf: PartialFunction[T, U])(implicit enc: Encoder[U]): Dataset[U] = {
ds.flatMap(pf.lift(_))
}
}
such that:
// val ds = Dataset(1, 2, 3)
ds.collectt { case x if x % 2 == 1 => x * 3 }
// Dataset(3, 9)
Note that I've unfortunately not been able to name it collect (thus the awful suffix t) as the signature would otherwise (I think) clash with the existing Dataset#collect method that transforms a Dataset into an Array.

What's the performance impact of converting between `DataFrame`, `RDD` and back?

While my first instinct is to use DataFrames for everything, it's just not possible -- some operations are clearly easier and / or better performing as RDD operations, not to mention certain APIs like GraphX only work on RDDs.
I seem to be spending a lot of time these days converting back and forth between DataFrames and RDDs -- so what's the performance hit? Take RDD.checkpoint -- there's no DataFrame equivalent, so what happens under the hood when I do:
val df = Seq((1,2),(3,4)).toDF("key","value")
val rdd = df.rdd.map(...)
val newDf = rdd.map(r => (r.getInt(0), r.getInt(1))).toDF("key","value")
Obviously, this is a trivally small example, but it would be great to know what happens behind the scene in the conversion.
Let's look at df.rdd first. This is defined as:
lazy val rdd: RDD[Row] = {
// use a local variable to make sure the map closure doesn't capture the whole DataFrame
val schema = this.schema
queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
}
So firstly, it runs queryExecution.toRdd, which basically prepares the execution plan based on the operators used to build up the DataFrame, and computes an RDD[InternalRow] that represents the outcome of plan.
Next these InternalRows (which are only for internal use) of that RDD will be mapped to normal Rows. This will entail the following for each row:
override def toScala(row: InternalRow): Row = {
if (row == null) {
null
} else {
val ar = new Array[Any](row.numFields)
var idx = 0
while (idx < row.numFields) {
ar(idx) = converters(idx).toScala(row, idx)
idx += 1
}
new GenericRowWithSchema(ar, structType)
}
}
So it loops over all elements, coverts them to 'scala' space (from Catalyst space), and creates the final row with them. toDf will pretty much do these things in reverse.
This all will indeed have some impact on your performance. How much depends on how complex these operations are compared to the things you do with the data. The bigger possible impact however will be that Spark's Catalyst optimizer can only optimize the operations between the conversions to and from RDDs, rather than optimize the full execution plan in its whole. It would be interesting to see which operations you have trouble with, I find most things can be done using basic expressions or UDFs. Using modules that only work on RDDs is a very valid use case though!

Reducing with a bloom filter

I would like to get a fast approximate set membership, based on a String-valued function applied to a large Spark RDD of String Vectors (~1B records). Basically the idea would be to reduce into a Bloom filter. This bloom filter could then be broadcasted to the workers for further use.
More specifically, I currently have
rdd: RDD[Vector[String]]
f: Vector[String] => String
val uniqueVals = rdd.map(f).distinct().collect()
val uv = sc.broadcast(uniqueVals)
But uniqueVals is too large to be practical, and I would like to replace it with something of smaller (and known) size, i.e. a bloom filter.
My questions:
is it possible to reduce into a Bloom filter, or do I have to collect first, and then construct it in the driver?
is there a mature Scala/Java Bloom filter implementation available that would be suitable for this?
Yes, Bloom filters can be reduced, because they have some nice properties (they are monoids). This means you can do all the aggregation operations in parallel, doing effectively just one pass over the data to construct the BloomFilter for each partition and then reduce those BloomFilters together to get a single BloomFilter that you can query for contains.
There are at least two implementations of BloomFilter in Scala and both seem mature projects (haven't actually used them in production). The first one is Breeze and the second one is Twitter's Algebird. Both contain implementations of different sketches and a lot more.
This is an example how to do that with Breeze:
import breeze.util.BloomFilter
val nums = List(1 to 20: _*).map(_.toString)
val rdd = sc.parallelize(nums, 5)
val bf = rdd.mapPartitions { iter =>
val bf = BloomFilter.optimallySized[String](10000, 0.001)
iter.foreach(i => bf += i)
Iterator(bf)
}.reduce(_ | _)
println(bf.contains("5")) // true
println(bf.contains("31")) // false

What is the easiest and most efficient way to make a min heap in Scala?

val maxHeap = scala.collection.mutable.PriorityQueue[Int] //Gives MaxHeap
What is the most concise and efficient way to use Ordering to turn a PriorityQueue into a minHeap?
You'll have to define your own Ordering :
scala> object MinOrder extends Ordering[Int] {
def compare(x:Int, y:Int) = y compare x
}
defined object MinOrder
Then use that when creating the heap :
scala> val minHeap = scala.collection.mutable.PriorityQueue.empty(MinOrder)
minHeap: scala.collection.mutable.PriorityQueue[Int] = PriorityQueue()
scala> minHeap.ord
res1: Ordering[Int] = MinOrder$#158ac84e
Update August 2016: you can consider the proposal chrisokasaki/scads/scala/heapTraits.scala from Chris Okasaki (chrisokasaki).
That proposal illustrates the "not-so-easy" part of an Heap:
Proof of concept for typesafe heaps with a merge operation.
Here, "typesafe" means that the interface will never allow different orderings to be mixed within the same heap.
In particular,
when adding an element to an existing heap, that insertion cannot involve an ordering different from the one used to create the existing heap, and
when merging two existing heaps, the heaps are guaranteed to have been created with the same ordering.
See its design.
val h1 = LeftistHeap.Min.empty[Int] // an empty min-heap of integers

Combination of elements

Problem:
Given a Seq seq and an Int n.
I basically want all combinations of the elements up to size n. The arrangement matters, meaning e.g. [1,2] is different that [2,1].
def combinations[T](seq: Seq[T], size: Int) = ...
Example:
combinations(List(1,2,3), 0)
//Seq(Seq())
combinations(List(1,2,3), 1)
//Seq(Seq(), Seq(1), Seq(2), Seq(3))
combinations(List(1,2,3), 2)
//Seq(Seq(), Seq(1), Seq(2), Seq(3), Seq(1,2), Seq(2,1), Seq(1,3), Seq(3,1),
//Seq(2,3), Seq(3,2))
...
What I have so far:
def combinations[T](seq: Seq[T], size: Int) = {
#tailrec
def inner(seq: Seq[T], soFar: Seq[Seq[T]]): Seq[Seq[T]] = seq match {
case head +: tail => inner(tail, soFar ++ {
val insertList = Seq(head)
for {
comb <- soFar
if comb.size < size
index <- 0 to comb.size
} yield comb.patch(index, insertList, 0)
})
case _ => soFar
}
inner(seq, IndexedSeq(IndexedSeq.empty))
}
What would be your approach to this problem? This method will be called a lot and therefore it should be made most efficient.
There are methods in the library like subsets or combinations (yea I chose the same name), which return iterators. I also thought about that, but I have no idea yet how to design this lazily.
Not sure if this is efficient enough for your purpose but it's the simplest approach.
def combinations[T](seq: Seq[T], size: Int) : Seq[Seq[T]] = {
(1 to size).flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
edit:
to make it lazy you can use a view
def combinations[T](seq: Seq[T], size: Int) : Iterable[Seq[T]] = {
(1 to size).view.flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
From permutations theory we know that the number of permutations of K elements taken from a set of N elements is
N! / (N - K)!
(see http://en.wikipedia.org/wiki/Permutation)
Therefore if you wanna build them all, you will have
algorithm complexity = number of permutations * cost of building each permutation
The potential optimization of the algorithm lies in minimizing the cost of building each permutation, by using a data structure that has some appending / prepending operation that runs in O(1).
You are using an IndexedSeq, which is a collection optimized for O(1) random access. When collections are optimized for random access they are backed by arrays. When using such collections (this is also valid for java ArrayList) you give up the guarantee of a low cost insertion operation: sometimes the array won't be big enough and the collection will have to create a new one and copy all the elements.
When using instead linked lists (such as scala List, which is the default implementation for Seq) you take the opposite choice: you give up constant time access for constant time insertion. In particular, scala List is a recursive data structure with constant time insertion at the front.
So if you are looking for high performance and you need the collection to be available eagerly, use a Seq.empty instead of IndexedSeq.empty and at each iteration prepend the new element at the head of the Seq. If you need something lazy, use Stream which will minimize memory occupation. Additional strategies worth exploring is to create an empty IndexedSeq for your first iteration, but not through Indexed.empty. Use instead the builder and try to provide an array which has the right size (N! / (N-K)!)