Efficient groupwise aggregation on Scala collections

Efficient groupwise aggregation on Scala collections - scala

I often need to do something like
coll.groupBy(f(_)).mapValues(_.foldLeft(x)(g(_,_)))
What is the best way to achieve the same effect, but avoid explicitly constructing the intermediate collections with groupBy?

You could fold the initial collection over a map holding your intermediate results:
def groupFold[A,B,X](as: Iterable[A], f: A => B, init: X, g: (X,A) => X): Map[B,X] =
as.foldLeft(Map[B,X]().withDefaultValue(init)){
case (m,a) => {
val key = f(a)
m.updated(key, g(m(key),a))
}
}
You said collection and I wrote Iterable, but you have to think whether order matters in the fold in your question.
If you want efficient code, you will probably use a mutable map as in Rex' answer.

You can't really do it as a one-liner, so you should be sure you need it before writing something more elaborate like this (written from a somewhat performance-minded view since you asked for "efficient"):
final case class Var[A](var value: A) { }
def multifold[A,B,C](xs: Traversable[A])(f: A => B)(zero: C)(g: (C,A) => C) = {
import scala.collection.JavaConverters._
val m = new java.util.HashMap[B, Var[C]]
xs.foreach{ x =>
val v = {
val fx = f(x)
val op = m.get(fx)
if (op != null) op
else { val nv = Var(zero); m.put(fx, nv); nv }
}
v.value = g(v.value, x)
}
m.asScala.mapValues(_.value)
}
(Depending on your use case you may wish to pack into an immutable map instead in the last step.) Here's an example of it in action:
scala> multifold(List("salmon","herring","haddock"))(_(0))(0)(_ + _.length)
res1: scala.collection.mutable.HashMap[Char,Int] = Map(h -> 14, s -> 6)
Now, you might notice something weird here: I'm using a Java HashMap. This is because Java's HashMaps are 2-3x faster than Scala's. (You can write the equivalent thing with a Scala HashMap, but it doesn't actually make things any faster than your original.) Consequently, this operation is 2-3x faster than what you posted. But unless you're under severe memory pressure, creating the transient collections doesn't actually hurt you much.

Related

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})

To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))

Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

loop until a condition stands in scala

I'd like to write a generic loop until a given condition stands, in a functional way.
I've came up with the following code :
def loop[A](a: A, f: A => A, cond: A => Boolean) : A =
if (cond(a)) a else loop(f(a), f, cond)
What are other alternatives ? Is there anything in scalaz ?
[update] It may be possible to use cats and to convert A => A into Reader and afterwards use tailRecM. Any help would be appreciated.

I agree with #wheaties's comment, but since you asked for alternatives, here you go:
You could represent the loop's steps as an iterator, then navigate to the first step where cond is true using .find:
val result = Iterator.iterate(a)(f).find(cond).get
I had originally misread, and answered as if the cond was the "keep looping while true" condition, as with C-style loops. Here's my response as if that was what you asked.
val steps = Iterator.iterate(a)(f).takeWhile(cond)
If all you want is the last A value, you can use steps.toIterable.last (oddly, Iterator doesn't have .last defined). Or you could collect all of the values to a list using steps.toList.
Example:
val steps = Iterator.iterate(0)(_ + 1).takeWhile(_ < 10)
// remember that an iterator is read-once, so if you call .toList, you can't call .last
val result = steps.toIterable.last
// result == 9

From your structure, I think what you are describing is closer to dropWhile than takeWhile. What follows is 100% educational and I don't suggest that this is useful or the proper way to solve this problem. Nevertheless, you might find it useful.
If you want to be generic to any container (List, Array, Option, etc.) You will need a method to access the first element of this container (a.k.a. the head):
trait HasHead[I[_]]{
def head[X](of: I[X]): X
}
object HasHead {
implicit val listHasHead = new HasHead[List] {
def head[X](of: List[X]) = of.head
}
implicit val arrayHasHead = new HasHead[Array] {
def head[X](of: Array[X]) = of.head
}
//...
}
Here is the generic loop adapted to work with any container:
def loop[I[_], A](
a: I[A],
f: I[A] => I[A],
cond: A => Boolean)(
implicit
hh: HasHead[I]): I[A] =
if(cond(hh.head(a))) a else loop(f(a), f, cond)
Example:
loop(List(1,2,3,4,5), (_: List[Int]).tail, (_: Int) > 2)
> List(3, 4, 5)

Scala, a cross between a foldLeft and a map supporting lazy evaluation

I have a collection which I want to map to a new collection, however each resulting value is dependent on the value before it in some way.I could solve this with a leftFold
val result:List[B] = (myList:List[A]).foldLeft(C -> List.empty[B]){
case ((c, list), a) =>
..some function returning something like..
C -> (B :: list)
}
The problem here is I need to iterate through the entire list to retrieve the resultant list. Say I wanted a function that maps TraversableOnce[A] to TraversableOnce[B] and only evaluate members as I call them?
It seems to me to be a fairly conventional problem so Im wondering if there is a common approach to this. What I currently have is:
implicit class TraversableOnceEx[T](val self : TraversableOnce[T]) extends AnyVal {
def foldyMappyFunction[A, U](a:A)(func:(A,T) => (A,U)):TraversableOnce[U] = {
var currentA = a
self.map { t =>
val result = func(currentA, t)
currentA = result._1
result._2
}
}
}
As far as functional purity goes, you couldn't run it in parallel, but otherwise it seems sound.
An example would be;
Return me each element and if it is the first time that element has appeared before.
val elements:TraversableOnce[E]
val result = elements.mappyFoldyFunction(Set.empty[E]) {
(s, e) => (s + e) -> (e -> s.contains(e))
}
result:TraversableOnce[(E,Boolean)]

You might be able to make use of the State Monad. Here is your example re-written using scalaz:
import scalaz._, Scalaz._
def foldyMappy(i: Int) = State[Set[Int], (Int, Boolean)](s => (s + i, (i, s contains(i))))
val r = List(1, 2, 3, 3, 6).traverseS(foldyMappy)(Set.empty[Int])._2
//List((1,false), (2,false), (3,false), (3,true), (6,false))
println(r)

It is look like you need SeqView. Use view or view(from: Int, until: Int) methods for create a non-strict view of list.

I really don't understand your example as your contains check will always result to false.
foldLeft is different. It will result in a single value by aggregating all elements of the list.
You clearly need map (List => List).
Anyway, answering your question about laziness:
you should use Stream instead of List. Stream doesn't evaluate the tail before actually calling it.
Stream API

Scala Parallel Collections- How to return early?

I have a list of possible input Values
val inputValues = List(1,2,3,4,5)
I have a really long to compute function that gives me a result
def reallyLongFunction( input: Int ) : Option[String] = { ..... }
Using scala parallel collections, I can easily do
inputValues.par.map( reallyLongFunction( _ ) )
To get what all the results are, in parallel. The problem is, I don't really want all the results, I only want the FIRST result. As soon as one of my input is a success, I want my output, and want to move on with my life. This did a lot of extra work.
So how do I get the best of both worlds? I want to
Get the first result that returns something from my long function
Stop all my other threads from useless work.
Edit -
I solved it like a dumb java programmer by having
#volatile var done = false;
Which is set and checked inside my reallyLongFunction. This works, but does not feel very scala. Would like a better way to do this....

(Updated: no, it doesn't work, doesn't do the map)
Would it work to do something like:
inputValues.par.find({ v => reallyLongFunction(v); true })
The implementation uses this:
protected[this] class Find[U >: T](pred: T => Boolean, protected[this] val pit: IterableSplitter[T]) extends Accessor[Option[U], Find[U]] {
#volatile var result: Option[U] = None
def leaf(prev: Option[Option[U]]) = { if (!pit.isAborted) result = pit.find(pred); if (result != None) pit.abort }
protected[this] def newSubtask(p: IterableSplitter[T]) = new Find(pred, p)
override def merge(that: Find[U]) = if (this.result == None) result = that.result
}
which looks pretty similar in spirit to your #volatile except you don't have to look at it ;-)

I took interpreted your question in the same way as huynhjl, but if you just want to search and discardNones, you could do something like this to avoid the need to repeat the computation when a suitable outcome is found:
class Computation[A,B](value: A, function: A => B) {
lazy val result = function(value)
}
def f(x: Int) = { // your function here
Thread.sleep(100 - x)
if (x > 5) Some(x * 10)
else None
}
val list = List.range(1, 20) map (i => new Computation(i, f))
val found = list.par find (_.result.isDefined)
//found is Option[Computation[Int,Option[Int]]]
val result = found map (_.result.get)
//result is Option[Int]
However find for parallel collections seems to do a lot of unnecessary work (see this question), so this might not work well, with current versions of Scala at least.
Volatile flags are used in the parallel collections (take a look at the source for find, exists, and forall), so I think your idea is a good one. It's actually better if you can include the flag in the function itself. It kills referential transparency on your function (i.e. for certain inputs your function now sometimes returns None rather than Some), but since you're discarding the stopped computations, this shouldn't matter.

If you're willing to use a non-core library, I think Futures would be a good match for this task. For instance:
Akka's Futures include Futures.firstCompletedOf
Twitter's Futures include Future.select
...both of which appear to enable the functionality you're looking for.

Nearest keys in a SortedMap

Given a key k in a SortedMap, how can I efficiently find the largest key m that is less than or equal to k, and also the smallest key n that is greater than or equal to k. Thank you.

Looking at the source code for 2.9.0, the following code seems about to be the best you can do
def getLessOrEqual[A,B](sm: SortedMap[A,B], bound: A): B = {
val key = sm.to(x).lastKey
sm(key)
}
I don't know exactly how the splitting of the RedBlack tree works, but I guess it's something like a O(log n) traversal of the tree/construction of new elements and then a balancing, presumable also O(log n). Then you need to go down the new tree again to get the last key. Unfortunately you can't retrieve the value in the same go. So you have to go down again to fetch the value.
In addition the lastKey might throw an exception and there is no similar method that returns an Option.
I'm waiting for corrections.
Edit and personal comment
The SortedMap area of the std lib seems to be a bit neglected. I'm also missing a mutable SortedMap. And looking through the sources, I noticed that there are some important methods missing (like the one the OP asks for or the ones pointed out in my answer) and also some have bad implementation, like 'last' which is defined by TraversableLike and goes through the complete tree from first to last to obtain the last element.
Edit 2
Now the question is reformulated my answer is not valid anymore (well it wasn't before anyway). I think you have to do the thing I'm describing twice for lessOrEqual and greaterOrEqual. Well you can take a shortcut if you find the equal element.

Scala's SortedSet trait has no method that will give you the closest element to some other element.
It is presently implemented with TreeSet, which is based on RedBlack. The RedBlack tree is not visible through methods on TreeSet, but the protected method tree is protected. Unfortunately, it is basically useless. You'd have to override methods returning TreeSet to return your subclass, but most of them are based on newSet, which is private.
So, in the end, you'd have to duplicate most of TreeSet. On the other hand, it isn't all that much code.
Once you have access to RedBlack, you'd have to implement something similar to RedBlack.Tree's lookup, so you'd have O(logn) performance. That's actually the same complexity of range, though it would certainly do less work.
Alternatively, you'd make a zipper for the tree, so that you could actually navigate through the set in constant time. It would be a lot more work, of course.

Using Scala 2.11.7, the following will give what you want:
scala> val set = SortedSet('a', 'f', 'j', 'z')
set: scala.collection.SortedSet[Char] = TreeSet(a, f, j, z)
scala> val beforeH = set.to('h').last
beforeH: Char = f
scala> val afterH = set.from('h').head
afterH: Char = j
Generally you should use lastOption and headOption as the specified elements may not exist. If you are looking to squeeze a little more efficiency out, you can try replacing from(...).head with keysIteratorFrom(...).head

Sadly, the Scala library only allows to make this type of query efficiently:
and also the smallest key n that is greater than or equal to k.
val n = TreeMap(...).keysIteratorFrom(k).next
You can hack this by keeping two structures, one with normal keys, and one with negated keys. Then you can use the other structure to make the second type of query.
val n = - TreeMap(...).keysIteratorFrom(-k).next

Looks like I should file a ticket to add 'fromIterator' and 'toIterator' methods to 'Sorted' trait.

Well, one option is certainly using java.util.TreeMap.
It has lowerKey and higherKey methods, which do excatly what you want.

I had a similar problem: I wanted to find the closest element to a given key in a SortedMap. I remember the answer to this question being, "You have to hack TreeSet," so when I had to implement it for a project, I found a way to wrap TreeSet without getting into its internals.
I didn't see jazmit's answer, which more closely answers the original poster's question with minimum fuss (two method calls). However, those method calls do more work than needed for this application (multiple tree traversals), and my solution provides lots of hooks where other users can modify it to their own needs.
Here it is:
import scala.collection.immutable.TreeSet
import scala.collection.SortedMap
// generalize the idea of an Ordering to metric sets
trait MetricOrdering[T] extends Ordering[T] {
def distance(x: T, y: T): Double
def compare(x: T, y: T) = {
val d = distance(x, y)
if (d > 0.0) 1
else if (d < 0.0) -1
else 0
}
}
class MetricSortedMap[A, B]
(elems: (A, B)*)
(implicit val ordering: MetricOrdering[A])
extends SortedMap[A, B] {
// while TreeSet searches for an element, keep track of the best it finds
// with *thread-safe* mutable state, of course
private val best = new java.lang.ThreadLocal[(Double, A, B)]
best.set((-1.0, null.asInstanceOf[A], null.asInstanceOf[B]))
private val ord = new MetricOrdering[(A, B)] {
def distance(x: (A, B), y: (A, B)) = {
val diff = ordering.distance(x._1, y._1)
val absdiff = Math.abs(diff)
// the "to" position is a key-null pair; the object of interest
// is the other one
if (absdiff < best.get._1)
(x, y) match {
// in practice, TreeSet always picks this first case, but that's
// insider knowledge
case ((to, null), (pos, obj)) =>
best.set((absdiff, pos, obj))
case ((pos, obj), (to, null)) =>
best.set((absdiff, pos, obj))
case _ =>
}
diff
}
}
// use a TreeSet as a backing (not TreeMap because we need to get
// the whole pair back when we query it)
private val treeSet = TreeSet[(A, B)](elems: _*)(ord)
// find the closest key and return:
// (distance to key, the key, its associated value)
def closest(to: A): (Double, A, B) = {
treeSet.headOption match {
case Some((pos, obj)) =>
best.set((ordering.distance(to, pos), pos, obj))
case None =>
throw new java.util.NoSuchElementException(
"SortedMap has no elements, and hence no closest element")
}
treeSet((to, null.asInstanceOf[B])) // called for side effects
best.get
}
// satisfy the contract (or throw UnsupportedOperationException)
def +[B1 >: B](kv: (A, B1)): SortedMap[A, B1] =
new MetricSortedMap[A, B](
elems :+ (kv._1, kv._2.asInstanceOf[B]): _*)
def -(key: A): SortedMap[A, B] =
new MetricSortedMap[A, B](elems.filter(_._1 != key): _*)
def get(key: A): Option[B] = treeSet.find(_._1 == key).map(_._2)
def iterator: Iterator[(A, B)] = treeSet.iterator
def rangeImpl(from: Option[A], until: Option[A]): SortedMap[A, B] =
new MetricSortedMap[A, B](treeSet.rangeImpl(
from.map((_, null.asInstanceOf[B])),
until.map((_, null.asInstanceOf[B]))).toSeq: _*)
}
// test it with A = Double
implicit val doubleOrdering =
new MetricOrdering[Double] {
def distance(x: Double, y: Double) = x - y
}
// and B = String
val stuff = new MetricSortedMap[Double, String](
3.3 -> "three",
1.1 -> "one",
5.5 -> "five",
4.4 -> "four",
2.2 -> "two")
println(stuff.iterator.toList)
println(stuff.closest(1.5))
println(stuff.closest(1000))
println(stuff.closest(-1000))
println(stuff.closest(3.3))
println(stuff.closest(3.4))
println(stuff.closest(3.2))

I've been doing:
val m = SortedMap(myMap.toSeq:_*)
val offsetMap = (m.toSeq zip m.keys.toSeq.drop(1)).map {
case ( (k,v),newKey) => (newKey,v)
}.toMap
When I want the results of my map off-set by one key. I'm also looking for a better way, preferably without storing an extra map.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Efficient groupwise aggregation on Scala collections - scala

I often need to do something like coll.groupBy(f(_)).mapValues(_.foldLeft(x)(g(_,_))) What is the best way to achieve the same effect, but avoid explicitly constructing the intermediate collections with groupBy?

Related

groupBy on List as LinkedHashMap instead of Map

loop until a condition stands in scala

Scala, a cross between a foldLeft and a map supporting lazy evaluation

Scala Parallel Collections- How to return early?

Nearest keys in a SortedMap

Categories

Resources