scala combining multiple sequences - scala

I have a couple of lists:
val aa = Seq(1,2,3,4)
val bb = Seq(Seq(2.0,3.0,4.0,5.0), Seq(1.0,2.0,3.0,4.0))
val cc = Seq("a", "B")
And want to combine them in the desired format of:
(1, 2.0, a), (2, 3.0, a), (3, 4.0, a), (4, 5.0, a), (1, 1.0, b), (2, 2.0, b), (3, 3.0, b), (4, 4.0, b)
but my combination of zip and flatMap
(aa, bb,cc).zipped.flatMap{
case (a, b,c) => {
b.map(b1 => (a,b1,c))
}
}
is only producing
(1,2.0,a), (1,3.0,a), (1,4.0,a), (1,5.0,a), (2,1.0,B), (2,2.0,B), (2,3.0,B), (2,4.0,B)
In java I would just iterate for over bb and then again in a nested loop iterate over the values.
What do I need to change to get the data in the desired format using neat functional scala?

How about this:
for {
(bs, c) <- bb zip cc
(a, b) <- aa zip bs
} yield (a, b, c)
Produces:
List(
(1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a),
(1,1.0,b), (2,2.0,b), (3,3.0,b), (4,4.0,b)
)
I doubt this could be made any more neat & functional.

Not exactly pretty to read but here is an option:
bb
.map(b => aa.zip(b)) // List(List((1,2.0), (2,3.0), (3,4.0), (4,5.0)), List((1,1.0), (2,2.0), (3,3.0), (4,4.0)))
.zip(cc) // List((List((1,2.0), (2,3.0), (3,4.0), (4,5.0)),a), (List((1,1.0), (2,2.0), (3,3.0), (4,4.0)),B))
.flatMap{ case (l, c) => l.map(t => (t._1, t._2, c)) } // List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))

Another approach using collect and map
scala> val result = bb.zip(cc).collect{
case bc => (aa.zip(bc._1).map(e => (e._1,e._2, bc._2)))
}.flatten
result: Seq[(Int, Double, String)] = List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))

Related

How to convert RDD[Array[String]] to RDD[(Int, HashMap[String, List])]?

I have input data:
time, id, counter, value
00.2, 1 , c1 , 0.2
00.2, 1 , c2 , 0.3
00.2, 1 , c1 , 0.1
and I want for every id to create a structure to store counters and values. After thinking about vectors and rejecting them, I came to this:
(id, Hashmap( (counter1, List(Values)), (Counter2, List(Values)) ))
(1, HashMap( (c1,List(0.2, 0.1)), (c2,List(0.3)))
The problem is that I can't convert to Hashmap inside the map transformation and additionaly I dont't know if I will be able to reduce by counter the list inside map.
Does anyone have any idea?
My code is :
val data = inputRdd
.map(y => (y(1).toInt, mutable.HashMap(y(2), List(y(3).toDouble)))).reduceByKey(_++_)
}
Off the top of my head, untested:
import collection.mutable.HashMap
inputRdd
.map{ case Array(t, id, c, v) => (id.toInt, (c, v)) }
.aggregateByKey(HashMap.empty[String, List[String]])(
{ case (m, (c, v)) => { m(c) ::= v; m } },
{ case (m1, m2) => { for ((k, v) <- m2) m1(k) ::= v ; m1 } }
)
Here's one approach:
val rdd = sc.parallelize(Seq(
("00.2", 1, "c1", 0.2),
("00.2", 1, "c2", 0.3),
("00.2", 1, "c1", 0.1)
))
rdd.
map{ case (t, i, c, v) => (i, (c, v)) }.
groupByKey.mapValues(
_.groupBy(_._1).mapValues(_.map(_._2)).map(identity)
).
collect
// res1: Array[(Int, scala.collection.immutable.Map[String,Iterable[Double]])] = Array(
// (1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
// )
Note that the final map(identity) is a remedy for the Map#mapValues not serializable problem suggested in this SO answer.
If, as you have mentioned, have inputRdd as
//inputRdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at ....
Then a simple groupBy and foldLeft on the grouped values should do the trick for you to have the final desired result
val resultRdd = inputRdd.groupBy(_(1))
.mapValues(x => x
.foldLeft(Map.empty[String, List[String]]){(a, b) => {
if(a.keySet.contains(b(2))){
val c = a ++ Map(b(2) -> (a(b(2)) ++ List(b(3))))
c
}
else{
val c = a ++ Map(b(2) -> List(b(3)))
c
}
}}
)
//resultRdd: org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Map[String,List[String]])] = MapPartitionsRDD[3] at mapValues at ...
//(1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
changing RDD[(String, scala.collection.immutable.Map[String,List[String]])] to RDD[(Int, HashMap[String,List[String]])] would just be casting and I hope it would be easier for you to do that
I hope the answer is helpful

Scala: How to merge lists by the first element of the tuple

Let say I have a list:
[(A, a), (A, b), (A, c), (B, a), (B, d)]
How do I make that list into:
[(A, [a,b,c]), (B, [a,d])]
with a single function?
Thanks
The groupBy function allows you to achieve this:
scala> val list = List((1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'd'))
list: List[(Int, Char)] = List((1,a), (1,b), (1,c), (2,a), (2,d))
scala> list.groupBy(_._1) // grouping by the first item in the tuple
res0: scala.collection.immutable.Map[Int,List[(Int, Char)]] = Map(2 -> List((2,a), (2,d)), 1 -> List((1,a), (1,b), (1,c)))
Just doing groupBy won't give you the expected format you desire. So i suggest you write a custom method for this.
def groupTuples[A,B](seq: Seq[(A,B)]): List[(A, List[B])] = {
seq.
groupBy(_._1).
mapValues(_.map(_._2).toList).toList
}
Then then invoke it to get the desired result.
val t = Seq((1,"I"),(1,"AM"),(1, "Koby"),(2,"UP"),(2,"UP"),(2,"AND"),(2,"AWAY"))
groupTuples[Int, String](t)

scala merge option sequences

Want to merge val A = Option(Seq(1,2)) and val B = Option(Seq(3,4)) to yield a new option sequence
val C = Option(Seq(1,2,3,4))
This
val C = Option(A.getOrElse(Nil) ++ B.getOrElse(Nil)),
seems faster and more idiomatic than
val C = Option(A.toList.flatten ++ B.toList.flatten)
But is there a better way? And am I right that getOrElse is faster and lighter than toList.flatten?
What about a neat for comprehension:
val Empty = Some(Nil)
val C = for {
a <- A orElse Empty
b <- B orElse Empty
} yield a ++ b
Creates less intermediate options.
Or, you could just do a somewhat cumbersome pattern matching:
(A, B) match {
case (None, None) => Nil
case (None, sb#Some(b)) => sb
case (sa#Some(a), None) => sa
case (Some(a), Some(b)) => Some(a ++ b)
}
I think this at least creates less intermediate collections than the double flatten.
Your first case:
// In this case getOrElse is not needed as the option is clearly not `None`.
// So, you can replace the following:
val C = Option(A.getOrElse(Nil) ++ B.getOrElse(Nil))
// By this:
val C = Option(A.get ++ B.get) // A simple concatenation of two sequences.
C: Option[Seq[Int]] = Some(List(1, 2, 3, 4))
Your second case/option is wrong for multiple reasons.
val C = Option(A.toList.flatten ++ B.toList.flatten)
Option[List[Int]] = Some(List(1, 2, 3, 4))
It returns the incorrect type Option[List[Int]] instead of Option[Seq[Int]]
It needlessly invokes toList on A & B. You could simply add the options and invoke flatten on them.
It is not DRY and redundantly calls flatten on both A.toList & B.toList whereas it could call flatten on (A ++ B)
Instead of this, you could do this more efficiently:
val E = Option((A ++ B).flatten.toSeq)
E: Option[Seq[Int]] = Some(List(1, 2, 3, 4))
Using foldLeft
Seq(Some(List(1, 2)), None).foldLeft(List.empty[Int])(_ ++ _.getOrElse(List.empty[Int]))
result: List[Int] = List(1, 2)
Using flatten twice
Seq(Some(Seq(1, 2, 3)), Some(4, 5, 6), None).flatten.flatten
result: Seq(1, 2, 3, 4, 5, 6)
Scala REPL
scala> val a = Some(Seq(1, 2, 3))
a: Some[Seq[Int]] = Some(List(1, 2, 3))
scala> val b = Some(Seq(4, 5, 6))
b: Some[Seq[Int]] = Some(List(4, 5, 6))
scala> val c = None
c: None.type = None
scala> val d = Seq(a, b, c).flatten.flatten
d: Seq[Int] = List(1, 2, 3, 4, 5, 6)

Cartesian product of two lists

Given a map where a digit is associated to several characters
scala> val conversion = Map("0" -> List("A", "B"), "1" -> List("C", "D"))
conversion: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] =
Map(0 -> List(A, B), 1 -> List(C, D))
I want to generate all possible character sequences based on a sequence of digits. Examples:
"00" -> List("AA", "AB", "BA", "BB")
"01" -> List("AC", "AD", "BC", "BD")
I can do this with for comprehensions
scala> val number = "011"
number: java.lang.String = 011
Create a sequence of possible characters per index
scala> val values = number map { case c => conversion(c.toString) }
values: scala.collection.immutable.IndexedSeq[List[java.lang.String]] =
Vector(List(A, B), List(C, D), List(C, D))
Generate all the possible character sequences
scala> for {
| a <- values(0)
| b <- values(1)
| c <- values(2)
| } yield a+b+c
res13: List[java.lang.String] = List(ACC, ACD, ADC, ADD, BCC, BCD, BDC, BDD)
Here things get ugly and it will only work for sequences of three digits. Is there any way to achieve the same result for any sequence length?
The following suggestion is not using a for-comprehension. But I don't think it's a good idea after all, because as you noticed you'd be tied to a certain length of your cartesian product.
scala> def cartesianProduct[T](xss: List[List[T]]): List[List[T]] = xss match {
| case Nil => List(Nil)
| case h :: t => for(xh <- h; xt <- cartesianProduct(t)) yield xh :: xt
| }
cartesianProduct: [T](xss: List[List[T]])List[List[T]]
scala> val conversion = Map('0' -> List("A", "B"), '1' -> List("C", "D"))
conversion: scala.collection.immutable.Map[Char,List[java.lang.String]] = Map(0 -> List(A, B), 1 -> List(C, D))
scala> cartesianProduct("01".map(conversion).toList)
res9: List[List[java.lang.String]] = List(List(A, C), List(A, D), List(B, C), List(B, D))
Why not tail-recursive?
Note that above recursive function is not tail-recursive. This isn't a problem, as xss will be short unless you have a lot of singleton lists in xss. This is the case, because the size of the result grows exponentially with the number of non-singleton elements of xss.
I could come up with this:
val conversion = Map('0' -> Seq("A", "B"), '1' -> Seq("C", "D"))
def permut(str: Seq[Char]): Seq[String] = str match {
case Seq() => Seq.empty
case Seq(c) => conversion(c)
case Seq(head, tail # _*) =>
val t = permut(tail)
conversion(head).flatMap(pre => t.map(pre + _))
}
permut("011")
I just did that as follows and it works
def cross(a:IndexedSeq[Tree], b:IndexedSeq[Tree]) = {
a.map (p => b.map( o => (p,o))).flatten
}
Don't see the $Tree type that am dealing it works for arbitrary collections too..

Reverse / transpose a one-to-many map in Scala

What is the best way to turn a Map[A, Set[B]] into a Map[B, Set[A]]?
For example, how do I turn a
Map(1 -> Set("a", "b"),
2 -> Set("b", "c"),
3 -> Set("c", "d"))
into a
Map("a" -> Set(1),
"b" -> Set(1, 2),
"c" -> Set(2, 3),
"d" -> Set(3))
(I'm using immutable collections only here. And my real problem has nothing to do with strings or integers. :)
with help from aioobe and Moritz:
def reverse[A, B](m: Map[A, Set[B]]) =
m.values.toSet.flatten.map(v => (v, m.keys.filter(m(_)(v)))).toMap
It's a bit more readable if you explicitly call contains:
def reverse[A, B](m: Map[A, Set[B]]) =
m.values.toSet.flatten.map(v => (v, m.keys.filter(m(_).contains(v)))).toMap
Best I've come up with so far is
val intToStrs = Map(1 -> Set("a", "b"),
2 -> Set("b", "c"),
3 -> Set("c", "d"))
def mappingFor(key: String) =
intToStrs.keys.filter(intToStrs(_) contains key).toSet
val newKeys = intToStrs.values.flatten
val inverseMap = newKeys.map(newKey => (newKey -> mappingFor(newKey))).toMap
Or another one using folds:
def reverse2[A,B](m:Map[A,Set[B]])=
m.foldLeft(Map[B,Set[A]]()){case (r,(k,s)) =>
s.foldLeft(r){case (r,e)=>
r + (e -> (r.getOrElse(e, Set()) + k))
}
}
Here's a one statement solution
orginalMap
.map{case (k, v)=>value.map{v2=>(v2,k)}}
.flatten
.groupBy{_._1}
.transform {(k, v)=>v.unzip._2.toSet}
This bit rather neatly (*) produces the tuples needed to construct the reverse map
Map(1 -> Set("a", "b"),
2 -> Set("b", "c"),
3 -> Set("c", "d"))
.map{case (k, v)=>v.map{v2=>(v2,k)}}.flatten
produces
List((a,1), (b,1), (b,2), (c,2), (c,3), (d,3))
Converting it directly to a map overwrites the values corresponding to duplicate keys though
Adding .groupBy{_._1} gets this
Map(c -> List((c,2), (c,3)),
a -> List((a,1)),
d -> List((d,3)),
b -> List((b,1), (b,2)))
which is closer. To turn those lists into Sets of the second half of the pairs.
.transform {(k, v)=>v.unzip._2.toSet}
gives
Map(c -> Set(2, 3), a -> Set(1), d -> Set(3), b -> Set(1, 2))
QED :)
(*) YMMV
A simple, but maybe not super-elegant solution:
def reverse[A,B](m:Map[A,Set[B]])={
var r = Map[B,Set[A]]()
m.keySet foreach { k=>
m(k) foreach { e =>
r = r + (e -> (r.getOrElse(e, Set()) + k))
}
}
r
}
The easiest way I can think of is:
// unfold values to tuples (v,k)
// for all values v in the Set referenced by key k
def vk = for {
(k,vs) <- m.iterator
v <- vs.iterator
} yield (v -> k)
// fold iterator back into a map
(Map[String,Set[Int]]() /: vk) {
// alternative syntax: vk.foldLeft(Map[String,Set[Int]]()) {
case (m,(k,v)) if m contains k =>
// Map already contains a Set, so just add the value
m updated (k, m(k) + v)
case (m,(k,v)) =>
// key not in the map - wrap value in a Set and return updated map
m updated (k, Set(v))
}