Sequence transformations in scala - scala

In scala, is there a simple way of transforming this sequence
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
into this Seq(("a", 4), ("b", 7), ("c", 4))?
Thanks

I'm not sure if you meant to have Strings in the second ordinate of the tuple. Assuming Seq[(String, Int)], you can use groupBy to group the elements by the first ordinate:
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
.groupBy(_._1)
.mapValues(_.map(_._2).sum)
.toSeq
Otherwise, you'll next an extra .toInt

Here is another way by unzipping and using the second item in the tuple.
val sq = sqSeq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
sq.groupBy(_._1)
.transform {(k,lt) => lt.unzip._2.sum}
.toSeq
The above code details:
scala> sq.groupBy(_._1)
res01: scala.collection.immutable.Map[String,Seq[(String, Int)]] = Map(b -> List((b,2), (b,5)), a -> List((a,1), (a,3)), c -> List((c,4)))
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}
res02: scala.collection.immutable.Map[String,Int] = Map(b -> 7, a -> 4, c -> 4)
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}.toSeq
res03: Seq[(String, Int)] = ArrayBuffer((b,7), (a,4), (c,4))

Related

how to create combinations of elements in a list in scala

I have a rdd of list of strings, like the following:
(['a','b','c'],['1','2','3','4'],['e','f'],...)
now I want to get the list consists of all pairwise combinations in each innerlist, like the following:
(('a','b'),('a','c'),('b','c'),('1','2'),('1','3'),('1','4'),('2','3'),'('2','4'),('3','4'),('e','f'),...)
How to do that?
You can use flatMap with List.combinations:
val rdd = sc.parallelize(Seq(List("a", "b", "c"), List("1", "2", "3", "4"), List("e", "f")))
// rdd: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[0] at parallelize at <console>:24
rdd.flatMap(list => list.combinations(2)).collect()
// res1: Array[List[String]] = Array(List(a, b), List(a, c), List(b, c), List(1, 2), List(1, 3), List(1, 4), List(2, 3), List(2, 4), List(3, 4), List(e, f))

Getting first n distinct Key Tuples in Scala Spark

I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy.....
you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

Scala - groupBy map values to list

I have the following input data
((A, 1, 4), (A, 2, 5), (A, 3, 6))
I would like to produce the following result
(A, (1, 2, 3), (4, 5, 6))
through grouping input by keys
What would be the correct way to do so in Scala?
((A, 1, 4), (A, 2, 5), (A, 3, 6))
In the case that this represents a List[(String, Int, Int)] type then try the following.
val l = List(("A", 1, 4), ("A", 2, 5), ("A", 3, 6), ("B", 1, 4), ("B", 2, 5), ("B", 3, 6))
l groupBy {_._1} map {
case (k, v) => (k, v map {
case (k, v1, v2) => (v1, v2)
} unzip)
}
This will result in a Map[String,(List[Int], List[Int])], i.e., a map with string keys mapped to tuples of two lists.
Map(A -> (List(1, 2, 3), List(4, 5, 6)), B -> (List(1, 2, 3), List(4, 5, 6)))
Try something like this:
def takeHeads[A](lists:List[List[A]]): (List[A], List[List[A]]) =
(lists.map(_.head), lists.map(_.tail))
def translate[A](lists:List[List[A]]): List[List[A]] =
if (lists.flatten.isEmpty) Nil else {
val t = takeHeads(lists)
t._1 :: translate(t._2)
}
yourValue.groupBy(_.head).mapValues(v => translate(v.map(_.tail)))
This produces a Map[Any,Any] when used on your value... but it should get you going in the right direction.

FlatmapValues on Map

Given a Seq of tuples like:
Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
)
I would like to groupBy and then flatMap the values to obtain something like:
Map(
b -> Set(4, 6, 5),
a -> Set(1, 2, 3)
)
My first implementation is:
Seq(
("a" -> Set(1,2)),
("a" -> Set(2,3)),
("b" -> Set(4,6)),
("b" -> Set(5,6))
) groupBy (_._1) mapValues (_ map (_._2)) mapValues (_.flatten.toSet)
I was wondering if there was a more efficient and maybe simpler way to achieve that result.
You were on the right track, but you can simplify a bit by using a single mapValues and combining the map and flatten:
val r = Seq(
("a" -> Set(1,2)),
("a" -> Set(2,3)),
("b" -> Set(4,6)),
("b" -> Set(5,6))
).groupBy(_._1).mapValues(_.flatMap(_._2).toSet)
I actually find this a lot more readable than the foldLeft version (but note that mapValues returns a non-strict collection, which may or may not be what you want).
I would use foldLeft, I think it's more readable, you can avoid groupBy
val r = Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
).foldLeft(Map[String, Set[Int]]()){
case (seed,(k,v)) => {
seed.updated(k,v ++ seed.getOrElse(k,Set[Int]()))
}
}
#grotrianster answer could be refined using the Semigroup binary operation |+| of Set and Map:
import scalaz.syntax.semigroup._
import scalaz.std.map._
import scalaz.std.set._
Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
).foldLeft(Map[String, Set[Int]]()){case (seed, (k, v)) => seed |+| Map(k -> v)}
Using reduce instead of fold:
Seq(
("a", Set(1, 2)),
("a", Set(2, 3)),
("b", Set(4, 6)),
("b", Set(5, 6))
).map(Map(_)).reduce({_ |+| _})
Treating Set and Map as Monoids:
Seq(
("a", Set(1, 2)),
("a", Set(2, 3)),
("b", Set(4, 6)),
("b", Set(5, 6))
).map(Map(_)).toList.suml

Scala Aggregate Projection

How can I project / transform the following
List(("A", 1.0), ("A", 3.0), ("B", 2.0), ("B", 2.0))
to
List(("A", 4.0),("B", 4.0))
So I aggregate by the string and sum the doubles?
val x = List(("A", 1.0), ("A", 3.0), ("B", 2.0), ("B", 2.0))
val y = x.groupBy(_._1).map { case (a,bs) => a -> bs.map(_._2).sum }
y.toList // List((A,4.0), (B,4.0))