Parallel collection processing of data larger than memory size - scala

Is there a simple way to use scala parallel collections without loading a full collection into memory?
For example I have a large collection and I'd like to perform a particular operation (fold) in parallel only on a small chunk, that fits into memory, than on another chunk and so on, and finally recombine results from all chunks.
I know, that actors could be used, but it would be really nice to use par-collections.
I've written a solution, but it isn't nice:
def split[A](list: Iterable[A], chunkSize: Int): Iterable[Iterable[A]] = {
new Iterator[Iterable[A]] {
var rest = list
def hasNext = !rest.isEmpty
def next = {
val chunk = rest.take(chunkSize)
rest = rest.drop(chunkSize)
chunk
}
}.toIterable
}
def foldPar[A](acc: A)(list: Iterable[A], chunkSize: Int, combine: ((A, A) => A)): A = {
val chunks: Iterable[Iterable[A]] = split(list, chunkSize)
def combineChunk: ((A,Iterable[A]) => A) = { case (res, entries) => entries.par.fold(res)(combine) }
chunks.foldLeft(acc)(combineChunk)
}
val chunkSize = 10000000
val x = 1 to chunkSize*10
def sum: ((Int,Int) => Int) = {case (acc,n) => acc + n }
foldPar(0)(x,chunkSize,sum)

Your idea is very neat and it's a pity there is no such function available already (AFAIK).
I just rephrased your idea into a bit shorter code. First, I feel that for parallel folding it's useful to use the concept of monoid - it's a structure with an associative operation and a zero element. The associativity is important because we don't know the order in which we combine result that are computed in parallel. And the zero element is important so that we can split computations into blocks and start folding each one from the zero. There is nothing new about it though, it's just what fold for Scala's collections expects.
// The function defined by Monoid's apply must be associative
// and zero its identity element.
trait Monoid[A]
extends Function2[A,A,A]
{
val zero: A
}
Next, Scala's Iterators already have a useful method grouped(Int): GroupedIterator[Seq[A]] which slices the iterator into fixed-size sequences. It's quite similar to your split. This allows us to cut the input into fixed-size blocks and then apply Scala's parallel collection methods on them:
def parFold[A](c: Iterator[A], blockSize: Int)(implicit monoid: Monoid[A]): A =
c.grouped(blockSize).map(_.par.fold(monoid.zero)(monoid))
.fold(monoid.zero)(monoid);
We fold each block using the parallel collections framework and then (without any parallelization) combine the intermediate results.
An example:
// Example:
object SumMonoid extends Monoid[Long] {
override val zero: Long = 0;
override def apply(x: Long, y: Long) = x + y;
}
val it = Iterator.range(1, 10000001).map(_.toLong)
println(parFold(it, 100000)(SumMonoid));

Related

Queue In Scala using list in scala

We can implement a queue in java simply by using ArrayList but in case of Scala Lists are immutable so how can I implement a queue using List in Scala.Somebody give me some hint about it.
This is from Scala's immutable Queue:
Queue is implemented as a pair of Lists, one containing the in elements and the other the out elements. Elements are added to the in list and removed from the out list. When the out list runs dry, the queue is pivoted by replacing the out list by in.reverse, and in by Nil.
So:
object Queue {
def empty[A]: Queue[A] = new Queue(Nil, Nil)
}
class Queue[A] private (in: List[A], out: List[A]) {
def isEmpty: Boolean = in.isEmpty && out.isEmpty
def push(elem: A): Queue[A] = new Queue(elem :: in, out)
def pop(): (A, Queue[A]) =
out match {
case head :: tail => (head, new Queue(in, tail))
case Nil =>
val head :: tail = in.reverse // throws exception if empty
(head, new Queue(Nil, tail))
}
}
var q = Queue.empty[Int]
(1 to 10).foreach(i => q = q.push(i))
while (!q.isEmpty) { val (i, r) = q.pop(); println(i); q = r }
With immutable Lists, you have to return a new List after any modifying operation. Once you've grasped that, it's straightforward. A minimal (but inefficient) implementation where the Queue is also immutable might be:
class Queue[T](content:List[T]) {
def pop() = new Queue(content.init)
def push(element:T) = new Queue(element::content)
def peek() = content.last
override def toString() = "Queue of:" + content.toString
}
val q= new Queue(List(1)) //> q : lists.queue.Queue[Int] = Queue of:List(1)
val r = q.push(2) //> r : lists.queue.Queue[Int] = Queue of:List(2, 1)
val s = r.peek() //> s : Int = 1
val t = r.pop() //> t : lists.queue.Queue[Int] = Queue of:List(2)
If we talk about mutable Lists, they wouldn't be an efficient structure for implementing a Queue for the following reason: Adding elements to the beginning of a list works very well (takes constant time), but popping elements off the end is not efficient at all (takes longer the more elements there are in the list).
You do, however, have Arrays in Scala. Accessing any element in an array takes constant time. Unfortunately arrays are not dynamically sized, so they wouldn't make good queues. They cannot grow as your queue grows. However ArrayBuffers do grow as your array grows. So that would be a great place to start.
Also, note that Scala already has a Queue class: scala.collection.mutable.Queue.
The only way to implement a Queue with an immutable List would be to use a var. Good luck!

What is the ideal collection for incremental (with multiple passings) filtering of collection?

I've seen many questions about Scala collections and could not decide.
This question was the most useful until now.
I think the core of the question is twofold:
1) Which are the best collections for this use case?
2) Which are the recommended ways to use them?
Details:
I am implementing an algorithm that iterates over all elements in a collection
searching for the one that matches a certain criterion.
After the search, the next step is to search again with a new criterion, but without the chosen element among the possibilities.
The idea is to create a sequence with all original elements ordered by the criterion (which changes at every new selection).
The original sequence doesn't really need to be ordered, but there can be duplicates (the algorithm will only pick one at a time).
Example with a small sequence of Ints (just to simplify):
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt), Seq()))
println("ti = " + ti)
//algorithm
def recur(collection: Seq[Int], already_selected: Seq[Int]): (Seq[Int], Seq[Int]) =
if (collection.isEmpty) (Seq(), already_selected)
else {
val selected = collection maxBy f(already_selected)
val rest = collection diff Seq(selected) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}
Try #inline and as icn suggested How can I idiomatically "remove" a single element from a list in Scala and close the gap?:
object Foo extends App {
#inline
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt()).zipWithIndex, Seq()))
println("ti = " + ti)
//algorithm
#tailrec
def recur(collection: Seq[(Int, Int)], already_selected: Seq[Int]): Seq[Int] =
if (collection.isEmpty) already_selected
else {
val (selected, i) = collection.maxBy(x => f(already_selected)(x._2))
val rest = collection.patch(i, Nil, 1) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}

Scala View + Stream combo causing OutOfMemory Error. How do I replace it with a View?

I was looking at solving a very simple problem, Eratosthenes sieve, using idiomatic Scala, for learning purposes.
I've learned a Stream caches, so it is not so performant when determining the nth element because it's an O(n) complexity access with memoisation of data, therefore not suitable for this situation.
def primes(nums: Stream[Int]): Stream[Int] = {
Stream.cons(nums.head,
primes((nums tail) filter (x => x % nums.head != 0)))
}
def ints(n: Int): Stream[Int] = {
Stream.cons(n, ints(n + 1))
};
def nthPrime(n: Int): Int = {
val prim = primes(ints(2)).view take n toList;
return prim(n - 1);
};
The Integer stream is the problematic one. While the prime number filtering is done, JVM runs OutOfMemory. What is the correct way to achieve the same functionality without using Streams?
Basically take a view of primes from a view of ints and display the last element, without memoisation?
I have had similar cases where a stream was a good idea, but I did not need to store it's values. In order to consume the stream without storing it's values I created (what I called) ThrowAwayIterator:
class ThrowAwayIterator[T](var stream: Stream[T]) extends Iterator[T] {
def hasNext: Boolean = stream.nonEmpty
def next(): T = {
val next = stream.head
stream = stream.tail
next
}
}
Make sure that you do not store a reference to the instance of stream that is passed in.

Encoding recursive tree-creation with while loop + stacks

I'm a bit embarassed to admit this, but I seem to be pretty stumped by what should be a simple programming problem. I'm building a decision tree implementation, and have been using recursion to take a list of labeled samples, recursively split the list in half, and turn it into a tree.
Unfortunately, with deep trees I run into stack overflow errors (ha!), so my first thought was to use continuations to turn it into tail recursion. Unfortunately Scala doesn't support that kind of TCO, so the only solution is to use a trampoline. A trampoline seems kinda inefficient and I was hoping there would be some simple stack-based imperative solution to this problem, but I'm having a lot of trouble finding it.
The recursive version looks sort of like (simplified):
private def trainTree(samples: Seq[Sample], usedFeatures: Set[Int]): DTree = {
if (shouldStop(samples)) {
DTLeaf(makeProportions(samples))
} else {
val featureIdx = getSplittingFeature(samples, usedFeatures)
val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
DTBranch(
trainTree(statsWithFeature, usedFeatures + featureIdx),
trainTree(statsWithoutFeature, usedFeatures + featureIdx),
featureIdx)
}
}
So basically I'm recursively subdividing the list into two according to some feature of the data, and passing through a list of used features so I don't repeat - that's all handled in the "getSplittingFeature" function so we can ignore it. The code is really simple! Still, I'm having trouble figuring out a stack-based solution that doesn't just use closures and effectively become a trampoline. I know we'll at least have to keep around little "frames" of arguments in the stack but I would like to avoid closure calls.
I get that I should be writing out explicitly what the callstack and program counter handle for me implicitly in the recursive solution, but I'm having trouble doing that without continuations. At this point it's hardly even about efficiency, I'm just curious. So please, no need to remind me that premature optimization is the root of all evil and the trampoline-based solution will probably work just fine. I know it probably will - this is basically a puzzle for it's own sake.
Can anyone tell me what the canonical while-loop-and-stack-based solution to this sort of thing is?
UPDATE: Based on Thipor Kong's excellent solution, I've coded up a while-loops/stacks/hashtable based implementation of the algorithm that should be a direct translation of the recursive version. This is exactly what I was looking for:
FINAL UPDATE: I've used sequential integer indices, as well as putting everything back into arrays instead of maps for performance, added maxDepth support, and finally have a solution with the same performance as the recursive version (not sure about memory usage but I would guess less):
private def trainTreeNoMaxDepth(startingSamples: Seq[Sample], startingMaxDepth: Int): DTree = {
// Use arraybuffer as dense mutable int-indexed map - no IndexOutOfBoundsException, just expand to fit
type DenseIntMap[T] = ArrayBuffer[T]
def updateIntMap[#specialized T](ab: DenseIntMap[T], idx: Int, item: T, dfault: T = null.asInstanceOf[T]) = {
if (ab.length <= idx) {ab.insertAll(ab.length, Iterable.fill(idx - ab.length + 1)(dfault)) }
ab.update(idx, item)
}
var currentChildId = 0 // get childIdx or create one if it's not there already
def child(childMap: DenseIntMap[Int], heapIdx: Int) =
if (childMap.length > heapIdx && childMap(heapIdx) != -1) childMap(heapIdx)
else {currentChildId += 1; updateIntMap(childMap, heapIdx, currentChildId, -1); currentChildId }
// go down
val leftChildren, rightChildren = new DenseIntMap[Int]() // heapIdx -> childHeapIdx
val todo = Stack((startingSamples, Set.empty[Int], startingMaxDepth, 0)) // samples, usedFeatures, maxDepth, heapIdx
val branches = new Stack[(Int, Int)]() // heapIdx, featureIdx
val nodes = new DenseIntMap[DTree]() // heapIdx -> node
while (!todo.isEmpty) {
val (samples, usedFeatures, maxDepth, heapIdx) = todo.pop()
if (shouldStop(samples) || maxDepth == 0) {
updateIntMap(nodes, heapIdx, DTLeaf(makeProportions(samples)))
} else {
val featureIdx = getSplittingFeature(samples, usedFeatures)
val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
todo.push((statsWithFeature, usedFeatures + featureIdx, maxDepth - 1, child(leftChildren, heapIdx)))
todo.push((statsWithoutFeature, usedFeatures + featureIdx, maxDepth - 1, child(rightChildren, heapIdx)))
branches.push((heapIdx, featureIdx))
}
}
// go up
while (!branches.isEmpty) {
val (heapIdx, featureIdx) = branches.pop()
updateIntMap(nodes, heapIdx, DTBranch(nodes(child(leftChildren, heapIdx)), nodes(child(rightChildren, heapIdx)), featureIdx))
}
nodes(0)
}
Just store the binary tree in an array, as described on Wikipedia: For node i, the left child goes into 2*i+1 and the right child in to 2*i+2. When doing "down", you keep a collection of todos, that still have to be splitted to reach a leaf. Once you've got only leafs, to go upward (from right to left in the array) to build the decision nodes:
Update: A cleaned up version, that also supports the features stored int the branches (type parameter B) and that is more functional/fully pure and that supports sparse trees with a map as suggested by ron.
Update2-3: Make economical use of name space for node ids and abstract over type of ids to allow of large trees. Take node ids from Stream.
sealed trait DTree[A, B]
case class DTLeaf[A, B](a: A, b: B) extends DTree[A, B]
case class DTBranch[A, B](left: DTree[A, B], right: DTree[A, B], b: B) extends DTree[A, B]
def mktree[A, B, Id](a: A, b: B, split: (A, B) => Option[(A, A, B)], ids: Stream[Id]) = {
#tailrec
def goDown(todo: Seq[(A, B, Id)], branches: Seq[(Id, B, Id, Id)], leafs: Map[Id, DTree[A, B]], ids: Stream[Id]): (Seq[(Id, B, Id, Id)], Map[Id, DTree[A, B]]) =
todo match {
case Nil => (branches, leafs)
case (a, b, id) :: rest =>
split(a, b) match {
case None =>
goDown(rest, branches, leafs + (id -> DTLeaf(a, b)), ids)
case Some((left, right, b2)) =>
val leftId #:: rightId #:: idRest = ids
goDown((right, b2, rightId) +: (left, b2, leftId) +: rest, (id, b2, leftId, rightId) +: branches, leafs, idRest)
}
}
#tailrec
def goUp[A, B](branches: Seq[(Id, B, Id, Id)], nodes: Map[Id, DTree[A, B]]): Map[Id, DTree[A, B]] =
branches match {
case Nil => nodes
case (id, b, leftId, rightId) :: rest =>
goUp(rest, nodes + (id -> DTBranch(nodes(leftId), nodes(rightId), b)))
}
val rootId #:: restIds = ids
val (branches, leafs) = goDown(Seq((a, b, rootId)), Seq(), Map(), restIds)
goUp(branches, leafs)(rootId)
}
// try it out
def split(xs: Seq[Int], b: Int) =
if (xs.size > 1) {
val (left, right) = xs.splitAt(xs.size / 2)
Some((left, right, b + 1))
} else {
None
}
val tree = mktree(0 to 1000, 0, split _, Stream.from(0))
println(tree)

Nearest keys in a SortedMap

Given a key k in a SortedMap, how can I efficiently find the largest key m that is less than or equal to k, and also the smallest key n that is greater than or equal to k. Thank you.
Looking at the source code for 2.9.0, the following code seems about to be the best you can do
def getLessOrEqual[A,B](sm: SortedMap[A,B], bound: A): B = {
val key = sm.to(x).lastKey
sm(key)
}
I don't know exactly how the splitting of the RedBlack tree works, but I guess it's something like a O(log n) traversal of the tree/construction of new elements and then a balancing, presumable also O(log n). Then you need to go down the new tree again to get the last key. Unfortunately you can't retrieve the value in the same go. So you have to go down again to fetch the value.
In addition the lastKey might throw an exception and there is no similar method that returns an Option.
I'm waiting for corrections.
Edit and personal comment
The SortedMap area of the std lib seems to be a bit neglected. I'm also missing a mutable SortedMap. And looking through the sources, I noticed that there are some important methods missing (like the one the OP asks for or the ones pointed out in my answer) and also some have bad implementation, like 'last' which is defined by TraversableLike and goes through the complete tree from first to last to obtain the last element.
Edit 2
Now the question is reformulated my answer is not valid anymore (well it wasn't before anyway). I think you have to do the thing I'm describing twice for lessOrEqual and greaterOrEqual. Well you can take a shortcut if you find the equal element.
Scala's SortedSet trait has no method that will give you the closest element to some other element.
It is presently implemented with TreeSet, which is based on RedBlack. The RedBlack tree is not visible through methods on TreeSet, but the protected method tree is protected. Unfortunately, it is basically useless. You'd have to override methods returning TreeSet to return your subclass, but most of them are based on newSet, which is private.
So, in the end, you'd have to duplicate most of TreeSet. On the other hand, it isn't all that much code.
Once you have access to RedBlack, you'd have to implement something similar to RedBlack.Tree's lookup, so you'd have O(logn) performance. That's actually the same complexity of range, though it would certainly do less work.
Alternatively, you'd make a zipper for the tree, so that you could actually navigate through the set in constant time. It would be a lot more work, of course.
Using Scala 2.11.7, the following will give what you want:
scala> val set = SortedSet('a', 'f', 'j', 'z')
set: scala.collection.SortedSet[Char] = TreeSet(a, f, j, z)
scala> val beforeH = set.to('h').last
beforeH: Char = f
scala> val afterH = set.from('h').head
afterH: Char = j
Generally you should use lastOption and headOption as the specified elements may not exist. If you are looking to squeeze a little more efficiency out, you can try replacing from(...).head with keysIteratorFrom(...).head
Sadly, the Scala library only allows to make this type of query efficiently:
and also the smallest key n that is greater than or equal to k.
val n = TreeMap(...).keysIteratorFrom(k).next
You can hack this by keeping two structures, one with normal keys, and one with negated keys. Then you can use the other structure to make the second type of query.
val n = - TreeMap(...).keysIteratorFrom(-k).next
Looks like I should file a ticket to add 'fromIterator' and 'toIterator' methods to 'Sorted' trait.
Well, one option is certainly using java.util.TreeMap.
It has lowerKey and higherKey methods, which do excatly what you want.
I had a similar problem: I wanted to find the closest element to a given key in a SortedMap. I remember the answer to this question being, "You have to hack TreeSet," so when I had to implement it for a project, I found a way to wrap TreeSet without getting into its internals.
I didn't see jazmit's answer, which more closely answers the original poster's question with minimum fuss (two method calls). However, those method calls do more work than needed for this application (multiple tree traversals), and my solution provides lots of hooks where other users can modify it to their own needs.
Here it is:
import scala.collection.immutable.TreeSet
import scala.collection.SortedMap
// generalize the idea of an Ordering to metric sets
trait MetricOrdering[T] extends Ordering[T] {
def distance(x: T, y: T): Double
def compare(x: T, y: T) = {
val d = distance(x, y)
if (d > 0.0) 1
else if (d < 0.0) -1
else 0
}
}
class MetricSortedMap[A, B]
(elems: (A, B)*)
(implicit val ordering: MetricOrdering[A])
extends SortedMap[A, B] {
// while TreeSet searches for an element, keep track of the best it finds
// with *thread-safe* mutable state, of course
private val best = new java.lang.ThreadLocal[(Double, A, B)]
best.set((-1.0, null.asInstanceOf[A], null.asInstanceOf[B]))
private val ord = new MetricOrdering[(A, B)] {
def distance(x: (A, B), y: (A, B)) = {
val diff = ordering.distance(x._1, y._1)
val absdiff = Math.abs(diff)
// the "to" position is a key-null pair; the object of interest
// is the other one
if (absdiff < best.get._1)
(x, y) match {
// in practice, TreeSet always picks this first case, but that's
// insider knowledge
case ((to, null), (pos, obj)) =>
best.set((absdiff, pos, obj))
case ((pos, obj), (to, null)) =>
best.set((absdiff, pos, obj))
case _ =>
}
diff
}
}
// use a TreeSet as a backing (not TreeMap because we need to get
// the whole pair back when we query it)
private val treeSet = TreeSet[(A, B)](elems: _*)(ord)
// find the closest key and return:
// (distance to key, the key, its associated value)
def closest(to: A): (Double, A, B) = {
treeSet.headOption match {
case Some((pos, obj)) =>
best.set((ordering.distance(to, pos), pos, obj))
case None =>
throw new java.util.NoSuchElementException(
"SortedMap has no elements, and hence no closest element")
}
treeSet((to, null.asInstanceOf[B])) // called for side effects
best.get
}
// satisfy the contract (or throw UnsupportedOperationException)
def +[B1 >: B](kv: (A, B1)): SortedMap[A, B1] =
new MetricSortedMap[A, B](
elems :+ (kv._1, kv._2.asInstanceOf[B]): _*)
def -(key: A): SortedMap[A, B] =
new MetricSortedMap[A, B](elems.filter(_._1 != key): _*)
def get(key: A): Option[B] = treeSet.find(_._1 == key).map(_._2)
def iterator: Iterator[(A, B)] = treeSet.iterator
def rangeImpl(from: Option[A], until: Option[A]): SortedMap[A, B] =
new MetricSortedMap[A, B](treeSet.rangeImpl(
from.map((_, null.asInstanceOf[B])),
until.map((_, null.asInstanceOf[B]))).toSeq: _*)
}
// test it with A = Double
implicit val doubleOrdering =
new MetricOrdering[Double] {
def distance(x: Double, y: Double) = x - y
}
// and B = String
val stuff = new MetricSortedMap[Double, String](
3.3 -> "three",
1.1 -> "one",
5.5 -> "five",
4.4 -> "four",
2.2 -> "two")
println(stuff.iterator.toList)
println(stuff.closest(1.5))
println(stuff.closest(1000))
println(stuff.closest(-1000))
println(stuff.closest(3.3))
println(stuff.closest(3.4))
println(stuff.closest(3.2))
I've been doing:
val m = SortedMap(myMap.toSeq:_*)
val offsetMap = (m.toSeq zip m.keys.toSeq.drop(1)).map {
case ( (k,v),newKey) => (newKey,v)
}.toMap
When I want the results of my map off-set by one key. I'm also looking for a better way, preferably without storing an extra map.