How to sum a list of tuples by keys - scala

I wrote the following code to sum by keys :
// ("A", 1) ("A", 4)
// ("B", 2) --> ("B", 2)
// ("A", 3)
def sumByKeys[A](tuples: List[(A, Long)]) : List[(A, Long)] = {
tuples.groupBy(_._1).mapValues(_.map(_._2).sum).toList
}
Is there a better way ?
Update : add .toList at the end

I guess this is the most simple immutable form without usage of any additional framework on top of scala.
UPD Actually forgot about final toList. This makes totally different picture in terms of perfomance, because of the mapValues view return type
You can try foldLeft, tailrec, something mutable and they have better perfomance
import annotation.tailrec
#tailrec
final def tailSum[A](tuples: List[(A, Long)], acc: Map[A, Long] = Map.empty[A, Long]): List[(A, Long)] = tuples match {
case (k, v) :: tail => tailSum(tail, acc + (k -> (v + acc.get(k).getOrElse(0L))))
case Nil => acc.toList
}
def foldLeftSum[A](tuples: List[(A, Long)]) = tuples.foldLeft(Map.empty[A, Long])({
case (acc, (k, v)) => acc + (k -> (v + acc.get(k).getOrElse(0L)))
}).toList
def mutableSum[A](tuples: List[(A, Long)]) = {
val m = scala.collection.mutable.Map.empty[A, Long].withDefault(_ => 0L)
for ((k, v) <- tuples) m += (k -> (v + m(k)))
m.toList
}
Updated perfomance testing is here https://gist.github.com/baskakov/8437895, briefly:
scala> avgTime("default", sumByKeys(tuples))
default avg time is 63 ms
scala> avgTime("tailrec", tailSum(tuples))
tailrec avg time is 48 ms
scala> avgTime("foldleft", foldLeftSum(tuples))
foldleft avg time is 45 ms
scala> avgTime("mutableSum", mutableSum(tuples))
mutableSum avg time is 41 ms

The best I can think of gets you slightly better performance and save two characters:
def sumByKeys[A](tuples: List[(A, Long)]) : List[(A, Long)] = {
tuples.groupBy(_._1).mapValues(_.unzip._2.sum)
}
On my machine with Bask.ws' benchmark it took 11ms instead of 13ms without the unzip.
EDIT: In fact I think the performance must be the same... don't know where those 2ms come from

A solution very similar to yours:
def sumByKeys[A](tuples: List[(A, Long)]): List[(A, Long)] =
tuples groupBy (_._1) map { case (k, v) => (k, v.map(_._2).sum) } toList
val l: List[(String, Long)] = List(("A", 1), ("B", 2), ("A", 3))
sumByKeys(l)
// result:
// List[(String, Long)] = List((A,4), (B,2))
What's interesting is that in your solution you use def mapValues[C](f: (B) ⇒ C): Map[A, C] which according to docs has "lazy" evaluation: "Transforms this map by applying a function to every retrieved value."
On the other hand def map[B](f: (A) ⇒ B): Map[B] will build a new collection: "Builds a new collection by applying a function to all elements of this immutable map."
So depending on your needs you could be lazily evaluating a large map, or eagerly evaluating a small one.

Using reduce,
def sumByKeys[A](tuples: List[(A, Long)]): List[(A, Long)] = {
tuples groupBy(_._1) map { _._2 reduce { (a,b) => (a._1, a._2+b._2) } } toList
}
a short for
def sumByKeys[A](tuples: List[(A, Long)]): List[(A, Long)] = {
tuples groupBy(_._1) map { case(k,v) => v reduce { (a,b) => (a._1, a._2+b._2) } } toList
}

Related

How to improve this "update" function?

Suppose I've got case class A(x: Int, s: String) and need to update a List[A] using a Map[Int, String] like that:
def update(as: List[A], map: Map[Int, String]): List[A] = ???
val as = List(A(1, "a"), A(2, "b"), A(3, "c"), A(4, "d"))
val map = Map(2 -> "b1", 4 -> "d1", 5 -> "e", 6 -> "f")
update(as, map) // List(A(1, "a"), A(2, "b1"), A(3, "c"), A(4, "d1"))
I am writing update like that:
def update(as: List[A], map: Map[Int, String]): List[A] = {
#annotation.tailrec
def loop(acc: List[A], rest: List[A], map: Map[Int, String]): List[A] = rest match {
case Nil => acc
case as => as.span(a => !map.contains(a.x)) match {
case (xs, Nil) => xs ++ acc
case (xs, y :: ys) => loop((y.copy(s = map(y.x)) +: xs) ++ acc, ys, map - y.x)
}
}
loop(Nil, as, map).reverse
}
This function works fine but it's suboptimal because it continues iterating over the input list when map is empty. Besides, it looks overcomplicated. How would you suggest improve this update function ?
If you can not make any supposition about the List and the Map. Then the best is to just iterate the former, juts once and in the simplest way possible; that is, using the map function.
list.map { a =>
map
.get(key = a.x)
.fold(ifEmpty = a) { s =>
a.copy(s = s)
}
}
However, if and only if, you can be sure that most of the time:
The List will be big.
The Map will be small.
The keys in the Map are a subset of the values in the List.
And all operations will be closer to the head of the List rather than the tail.
Then, you could use the following approach which should be more efficient in such cases.
def optimizedUpdate(data: List[A], updates: Map[Int, String]): List[A] = {
#annotation.tailrec
def loop(remaining: List[A], map: Map[Int, String], acc: List[A]): List[A] =
if (map.isEmpty) acc reverse_::: remaining
else remaining match {
case a :: as =>
map.get(key = a.x) match {
case None =>
loop(
remaining = as,
map,
a :: acc
)
case Some(s) =>
loop(
remaining = as,
map = map - a.x,
a.copy(s = s) :: acc
)
}
case Nil =>
acc.reverse
}
loop(remaining = data, map = updates, acc = List.empty)
}
However note that the code is not only longer and more difficult to understand.
It is actually more inefficient than the map solution (if the conditions are not meet); this is because the stdlib implementation "cheats" and constructs the List my mutating its tail instead of building it backwards and then reversing it as we did.
In any case, as with any things performance, the only real answer is to benchmark.
But, I would go with the map solution just for clarity or with a mutable approach if you really need speed.
You can see the code running here.
How about
def update(as: List[A], map: Map[Int, String]): List[A] =
as.foldLeft(List.empty[A]) { (agg, elem) =>
val newA = map
.get(elem.x)
.map(a => elem.copy(s = a))
.getOrElse(elem)
newA :: agg
}.reverse

Reduce RDD[Map[T, V]] by merging maps

I have an RDD of maps, where the maps are certain to have intersecting key sets. Each map may have 10,000s of entries.
I need to merge the maps, such that those with intersecting key sets are merged, but others are left distinct.
Here's what I have. I haven't tested that it works, but I know that it's slow.
def mergeOverlapping(maps: RDD[Map[Int, Int]])(implicit sc: SparkContext): RDD[Map[Int, Int]] = {
val in: RDD[List[Map[Int, Int]]] = maps.map(List(_))
val z = List.empty[Map[Int, Int]]
val t: List[Map[Int, Int]] = in.fold(z) { case (l, r) =>
(l ::: r).foldLeft(List.empty[Map[Int, Int]]) { case (acc, next) =>
val (overlapping, distinct) = acc.partition(_.keys.exists(next.contains))
overlapping match {
case Nil => next :: acc
case xs => (next :: xs).reduceLeft(merge) :: distinct
}
}
}
sc.parallelize(t)
}
def merge(l: Map[Int, Int], r: Map[Int, Int]): Map[Int, Int] = {
val keys = l.keySet ++ r.keySet
keys.map { k =>
(l.get(k), r.get(k)) match {
case (Some(i), Some(j)) => k -> math.min(i, j)
case (a, b) => k -> (a orElse b).get
}
}.toMap
}
The problem, as far as I can tell, is that RDD#fold is merging and re-merging maps many more times than it has to.
Is there a more efficient mechanism that I could use? Is there another way I can structure my data to make it efficient?

Transform a Scala Sequence in Pairs

I have a sequence like this:
val l = Seq(1,2,3,4)
which I want to transform to List(Seq(1,2), Seq(2,3), Seq(3,4))
Here is what I tried:
def getPairs(inter: Seq[(Int, Int)]): Seq[(Int, Int)] = l match {
case Nil => inter
case x :: xs => getPairs(inter :+ (x, xs.head))
}
This strangely seems not to work? Any suggestions?
You can also just use sliding:
l.sliding(2).toList
res1: List[Seq[Int]] = List(List(1, 2), List(2, 3), List(3, 4))
Ok I got to know about the zip method:
xs zip xs.tail
Using a for comprehension, for instance as follows,
for ( (a,b) <- l zip l.drop(1) ) yield Seq(a,b)
Note l.drop(1) (in contrast to l.tail) will deliver an empty list if l is empty or has at most one item.
The already given answers describe well, how to do this in a scala way.
However, you might also want an explanation why your code does not work, so here it comes:
Your getPairs function expects a list of tuples as input and returns a list of tuples. But you say you want to transform a list of single values into a list to tuples. So if you call getPairs(l) you will get a type mismatch compiler error.
You would have to refactor your code to take a simple list:
def pairs(in: Seq[Int]): Seq[(Int, Int)] = {
#tailrec
def recursive(remaining: Seq[Int], result: Seq[(Int, Int)]): Seq[(Int, Int)] = {
remaining match {
case Nil => result
case last +: Nil => result
case head +: next +: tail => recursive(next +: tail, (head, next) +: result)
}
}
recursive(in, Nil).reverse
}
and from here it's a small step to a generic function:
def pairs2[A](in: Seq[A]): Seq[(A, A)] = {
#tailrec
def recursive(remaining: Seq[A], result: Seq[(A, A)]): Seq[(A, A)] = {
remaining match {
case Nil => result
case last +: Nil => result
case head +: next +: tail => recursive(next +: tail, (head, next) +: result)
}
}
recursive(in, Nil).reverse
}

With scala, can I unapply a tuple and then run a map over it?

I have some financial data gathered at a List[(Int, Double)], like this:
val snp = List((2001, -13.0), (2002, -23.4))
With this, I wrote a formula that would transform the list, through map, into another list (to demonstrate investment grade life insurance), where losses below 0 are converted to 0, and gains above 15 are converted to 15, like this:
case class EiulLimits(lower:Double, upper:Double)
def eiul(xs: Seq[(Int, Double)], limits:EiulLimits): Seq[(Int, Double)] = {
xs.map(item => (item._1,
if (item._2 < limits.lower) limits.lower
else if (item._2 > limits.upper) limits.upper
else item._2
}
Is there anyway to extract the tuple's values inside this, so I don't have to use the clunky _1 and _2 notation?
List((1,2),(3,4)).map { case (a,b) => ... }
The case keyword invokes the pattern matching/unapply logic.
Note the use of curly braces instead of parens after map
And a slower but shorter quick rewrite of your code:
case class EiulLimits(lower: Double, upper: Double) {
def apply(x: Double) = List(x, lower, upper).sorted.apply(1)
}
def eiul(xs: Seq[(Int, Double)], limits: EiulLimits) = {
xs.map { case (a,b) => (a, limits(b)) }
}
Usage:
scala> eiul(List((1, 1.), (3, 3.), (4, 4.), (9, 9.)), EiulLimits(3., 7.))
res7: Seq[(Int, Double)] = List((1,3.0), (3,3.0), (4,4.0), (7,7.0), (9,7.0))
scala> val snp = List((2001, -13.0), (2002, -23.4))
snp: List[(Int, Double)] = List((2001,-13.0), (2002,-23.4))
scala> snp.map {case (_, x) => x}
res2: List[Double] = List(-13.0, -23.4)
scala> snp.map {case (x, _) => x}
res3: List[Int] = List(2001, 2002)

In Scala, how to use Ordering[T] with List.min or List.max and keep code readable

In Scala 2.8, I had a need to call List.min and provide my own compare function to get the value based on the second element of a Tuple2. I had to write this kind of code:
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min( new Ordering[Tuple2[String,Int]] {
def compare(x:Tuple2[String,Int],y:Tuple2[String,Int]): Int = x._2 compare y._2
} )
Is there a way to make this more readable or to create an Ordering out of an anonymous function like you can do with list.sortBy(_._2)?
In Scala 2.9, you can do list minBy { _._2 }.
C'mon guys, you made the poor questioner find "on" himself. Pretty shabby performance. You could shave a little further writing it like this:
list min Ordering[Int].on[(_,Int)](_._2)
Which is still far too noisy but that's where we are at the moment.
One thing you can do is use the more concise standard tuple type syntax instead of using Tuple2:
val min = list.min(new Ordering[(String, Int)] {
def compare(x: (String, Int), y: (String, Int)): Int = x._2 compare y._2
})
Or use reduceLeft to have a more concise solution altogether:
val min = list.reduceLeft((a, b) => (if (a._2 < b._2) a else b))
Or you could sort the list by your criterion and get the first element (or last for the max):
val min = list.sort( (a, b) => a._2 < b._2 ).first
Which can be further shortened using the placeholder syntax:
val min = list.sort( _._2 < _._2 ).first
Which, as you wrote yourself, can be shortened to:
val min = list.sortBy( _._2 ).first
But as you suggested sortBy yourself, I'm not sure if you are looking for something different here.
The function Ordering#on witnesses the fact that Ordering is a contra-variant functor. Others include Comparator, Function1, Comparable and scalaz.Equal.
Scalaz provides a unified view on these types, so for any of them you can adapt the input with value contramap f, or with symbolic denotation, value ∙ f
scala> import scalaz._
import scalaz._
scala> import Scalaz._
import Scalaz._
scala> val ordering = implicitly[scala.Ordering[Int]] ∙ {x: (_, Int) => x._2}
ordering: scala.math.Ordering[Tuple2[_, Int]] = scala.math.Ordering$$anon$2#34df289d
scala> List(("1", 1), ("2", 2)) min ordering
res2: (java.lang.String, Int) = (1,1)
Here's the conversion from the Ordering[Int] to Ordering[(_, Int)] in more detail:
scala> scalaz.Scalaz.maContravariantImplicit[Ordering, Int](Ordering.Int).contramap { x: (_, Int) => x._2 }
res8: scala.math.Ordering[Tuple2[_, Int]] = scala.math.Ordering$$anon$2#4fa666bf
list.min(Ordering.fromLessThan[(String, Int)](_._2 < _._2))
Which is still too verbose, of course. I'd probably declare it as a val or object.
You could always define your own implicit conversion:
implicit def funToOrdering[T,R <% Ordered[R]](f: T => R) = new Ordering[T] {
def compare(x: T, y: T) = f(x) compare f(y)
}
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min { t: (String,Int) => t._2 } // (c, 2)
EDIT: Per #Dario's comments.
Might be more readable if the conversion wasn't implicit, but using an "on" function:
def on[T,R <% Ordered[R]](f: T => R) = new Ordering[T] {
def compare(x: T, y: T) = f(x) compare f(y)
}
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min( on { t: (String,Int) => t._2 } ) // (c, 2)