The operation about Merge two tuples - scala

Now there are two thus tuples.
1st tuple:((A,1),(B,3),(D,5)......)
2nd tuple:((A,3),(B,1),(E,6)......)
And the function is to merge those two tuples to this.
((A,1,3),(B,3,1),(D,5,0),(E,0,6)......)
If the first tuple contains a key that is not in the second tuple, set the value to 0, and vice versa. How could I code this function in scala?

Lets say you get the input in the format
val tuple1: List[(String, Int)] = List(("A",1),("B",3),("D",5),("E",0))
val tuple2: List[(String, Int)] = List(("A",3),("B",1),("D",6))
You can write a merge function as
def merge(tuple1: List[(String, Int)],tuple2: List[(String, Int)]) =
{
val map1 = tuple1.toMap
val map2 = tuple2.toMap
map1.map{ case (k,v) =>
(k,v,map2.get(k).getOrElse(0))
}
}
On calling the function
merge(tuple1,tuple2)
You will get the output as
res0: scala.collection.immutable.Iterable[(String, Int, Int)] = List((A,1,3), (B,3,1), (D,5,6), (E,0,0))
Please let me know if that answers your question.

val t1= List(("A",1),("B",3),("D",5),("E",0))
val t2= List(("A",3),("B",1),("D",6),("H",5))
val t3 = t2.filter{ case (k,v) => !t1.exists(case (k1,_) => k1==k)) }.map{case (k,_) => (k,0)}
val t4 = t1.filter{ case (k,v) => !t2.exists{case (k1,_) => k1==k} }.map{case (k,_) => (k,0)}
val t5=(t1 ++ t3).sortBy{case (k,v) => k}
val t6=(t2 ++ t4).sortBy{case (k,v) => k}
t5.zip(t6).map{case ((k,v1),(_,v2)) => (k,v1,v2) }
res: List[(String, Int, Int)] = List(("A", 1, 3), ("B", 3, 1), ("D", 5, 6), ("E", 0, 0), ("H", 0, 5))
In terms of what's happening here
t3 and t4 - find the missing value in t1 and t2 respectively and add them with a zero value
t5 and t6 sort the unified list (t1 with t3 and t2 with t4). Lastly they are zipped together and transformed to the desired output

Related

Spark scala faster way to groupbykey and sort rdd values [duplicate]

This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))

Check if RDD contains same key and merge them if yes

I have a RDD[(String,Map[String,Int])],
[("A",Map("acs"->2,"sdv"->2,"sfd"->1),("B",Map("ass"->2,"fvv"->2,"ffd"->1)),("A"),Map("acs"->2,"sdv"->2,"sfd"->1)]
I want to merge the elements with same key as,
[("A",Map("acs"->4,"sdv"->4,"sfd"->2),("B",Map("ass"->2,"fvv"->2,"ffd"->1))]
How to do this is in scala?
If you define mapSum (see merge two maps and sum values):
def mapSum[T](map1: Map[T, Int], map2: Map[T, Int]): Map[T, Int] = map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Then you can groupBy and reduce (similar to your other question):
# rdd.groupBy(_._1).map(_._2.reduce((a, b) => (a._1, mapSum(a._2, b._2)))).collect
res11: Array[(String, Map[String, Int])] = Array(
("A", Map("acs" -> 4, "sdv" -> 4, "sfd" -> 2)),
("B", Map("ass" -> 2, "fvv" -> 2, "ffd" -> 1))
)
An efficient approach would be to use reduceByKey to aggregate the Map (in the accumulator) by summing the values of matched keys:
val rdd = sc.parallelize(Seq(
("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("A", Map("acs"->2, "sdv"->2, "sfd"->1))
))
rdd.reduceByKey( (acc, m) =>
acc ++ m.map{ case (k, v) => (k, acc.getOrElse(k, 0) + v) }
).collect
// res1: Array[(String, scala.collection.immutable.Map[String,Int])] = Array(
// (A,Map(acs -> 4, sdv -> 4, sfd -> 2)),
// (B,Map(ass -> 2, fvv -> 2, ffd -> 1))
// )

How to convert RDD[Array[String]] to RDD[(Int, HashMap[String, List])]?

I have input data:
time, id, counter, value
00.2, 1 , c1 , 0.2
00.2, 1 , c2 , 0.3
00.2, 1 , c1 , 0.1
and I want for every id to create a structure to store counters and values. After thinking about vectors and rejecting them, I came to this:
(id, Hashmap( (counter1, List(Values)), (Counter2, List(Values)) ))
(1, HashMap( (c1,List(0.2, 0.1)), (c2,List(0.3)))
The problem is that I can't convert to Hashmap inside the map transformation and additionaly I dont't know if I will be able to reduce by counter the list inside map.
Does anyone have any idea?
My code is :
val data = inputRdd
.map(y => (y(1).toInt, mutable.HashMap(y(2), List(y(3).toDouble)))).reduceByKey(_++_)
}
Off the top of my head, untested:
import collection.mutable.HashMap
inputRdd
.map{ case Array(t, id, c, v) => (id.toInt, (c, v)) }
.aggregateByKey(HashMap.empty[String, List[String]])(
{ case (m, (c, v)) => { m(c) ::= v; m } },
{ case (m1, m2) => { for ((k, v) <- m2) m1(k) ::= v ; m1 } }
)
Here's one approach:
val rdd = sc.parallelize(Seq(
("00.2", 1, "c1", 0.2),
("00.2", 1, "c2", 0.3),
("00.2", 1, "c1", 0.1)
))
rdd.
map{ case (t, i, c, v) => (i, (c, v)) }.
groupByKey.mapValues(
_.groupBy(_._1).mapValues(_.map(_._2)).map(identity)
).
collect
// res1: Array[(Int, scala.collection.immutable.Map[String,Iterable[Double]])] = Array(
// (1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
// )
Note that the final map(identity) is a remedy for the Map#mapValues not serializable problem suggested in this SO answer.
If, as you have mentioned, have inputRdd as
//inputRdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at ....
Then a simple groupBy and foldLeft on the grouped values should do the trick for you to have the final desired result
val resultRdd = inputRdd.groupBy(_(1))
.mapValues(x => x
.foldLeft(Map.empty[String, List[String]]){(a, b) => {
if(a.keySet.contains(b(2))){
val c = a ++ Map(b(2) -> (a(b(2)) ++ List(b(3))))
c
}
else{
val c = a ++ Map(b(2) -> List(b(3)))
c
}
}}
)
//resultRdd: org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Map[String,List[String]])] = MapPartitionsRDD[3] at mapValues at ...
//(1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
changing RDD[(String, scala.collection.immutable.Map[String,List[String]])] to RDD[(Int, HashMap[String,List[String]])] would just be casting and I hope it would be easier for you to do that
I hope the answer is helpful

combine two lists with same keys

Here's a quite simple request to combine two lists as following:
scala> list1
res17: List[(Int, Double)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4))
scala> list2
res18: List[(Int, String)] = List((1,aaa), (2,bbb), (3,ccc), (4,ddd))
The desired output is as:
((aaa,0.1),(bbb,0.2),(ccc,0.3),(ddd,0.4))
I tried:
scala> (list1 ++ list2)
res23: List[(Int, Any)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4),
(1,aaa), (2,bbb), (3,ccc), (4,ddd))
But:
scala> (list1 ++ list2).groupByKey
<console>:10: error: value groupByKey is not a member of List[(Int,
Any)](list1 ++ list2).groupByKey
Any hints? Thanks!
The method you're looking for is groupBy:
(list1 ++ list2).groupBy(_._1)
If you know that for each key you have exactly two values, you can join them:
scala> val pairs = List((1, "a1"), (2, "b1"), (1, "a2"), (2, "b2"))
pairs: List[(Int, String)] = List((1,a1), (2,b1), (1,a2), (2,b2))
scala> pairs.groupBy(_._1).values.map {
| case List((_, v1), (_, v2)) => (v1, v2)
| }
res0: Iterable[(String, String)] = List((b1,b2), (a1,a2))
Another approach using zip is possible if the two lists contain the same keys in the same order:
scala> val l1 = List((1, "a1"), (2, "b1"))
l1: List[(Int, String)] = List((1,a1), (2,b1))
scala> val l2 = List((1, "a2"), (2, "b2"))
l2: List[(Int, String)] = List((1,a2), (2,b2))
scala> l1.zip(l2).map { case ((_, v1), (_, v2)) => (v1, v2) }
res1: List[(String, String)] = List((a1,a2), (b1,b2))
Here's a quick one-liner:
scala> list2.map(_._2) zip list1.map(_._2)
res0: List[(String, Double)] = List((aaa,0.1), (bbb,0.2), (ccc,0.3), (ddd,0.4))
If you are unsure why this works then read on! I'll expand it step by step:
list2.map(<function>)
The map method iterates over each value in list2 and applies your function to it. In this case each of the values in list2 is a Tuple2 (a tuple with two values). What you are wanting to do is access the second tuple value. To access the first tuple value use the ._1 method and to access the second tuple value use the ._2 method. Here is an example:
val myTuple = (1.0, "hello") // A Tuple2
println(myTuple._1) // prints "1.0"
println(myTuple._2) // prints "hello"
So what we want is a function literal that takes one parameter (the current value in the list) and returns the second tuple value (._2). We could have written the function literal like this:
list2.map(item => item._2)
We don't need to specify a type for item because the compiler is smart enough to infer it thanks to target typing. A really helpful shortcut is that we can just leave out item altogether and replace it with a single underscore _. So it gets simplified (or cryptified, depending on how you view it) to:
list2.map(_._2)
The other interesting part about this one liner is the zip method. All zip does is take two lists and combines them into one list just like the zipper on your favorite hoodie does!
val a = List("a", "b", "c")
val b = List(1, 2, 3)
a zip b // returns ((a,1), (b,2), (c,3))
Cheers!

Scala: using foldl to add pairs from list to a map?

I am trying to add pairs from list to a map using foldl. I get the following error:
"missing arguments for method /: in trait TraversableOnce; follow this method with `_' if you want to treat it as a partially applied function"
code:
val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
def lstToMap(lst:List[(String,Int)], map: Map[String, Int] ) = {
(map /: lst) addToMap ( _, _)
}
def addToMap(pair: (String, Int), map: Map[String, Int]): Map[String, Int] = {
map + (pair._1 -> pair._2)
}
What is wrong?
scala> val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
pairs: List[(String, Int)] = List((a,1), (a,2), (c,3), (d,4))
scala> (Map.empty[String, Int] /: pairs)(_ + _)
res9: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
But you know, you could just do:
scala> pairs.toMap
res10: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
You need to swap the input values of addToMap and put it in parenthesis for this to work:
def addToMap( map: Map[String, Int], pair: (String, Int)): Map[String, Int] = {
map + (pair._1 -> pair._2)
}
def lstToMap(lst:List[(String,Int)], map: Map[String, Int] ) = {
(map /: lst)(addToMap)
}
missingfaktor's answer is much more concise, reusable, and scala-like.
If you already have a collection of Tuple2s, you don't need to implement this yourself, there is already a toMap method that only works if the elements are tuples!
The full signature is:
def toMap[T, U](implicit ev: <:<[A, (T, U)]): Map[T, U]
It works by requiring an implicit A <:< (T, U) which is essentially a function that can take the element type A and cast/convert it to tuples of type (T, U). Another way of saying this is that it requires an implicit witness that A is-a (T, U). Therefore, this is completely type-safe.
Update: which is what #missingfaktor said
This is not a direct answer to the question, which is about folding correctly on the map, but I deem it important to emphasize that
a Map can be treated as a generic Traversable of pairs
and you can easily combine the two!
scala> val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
pairs: List[(String, Int)] = List((a,1), (a,2), (c,3), (d,4))
scala> Map.empty[String, Int] ++ pairs
res1: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
scala> pairs.toMap
res2: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)