Set list to data according to key - scala

I have an RDD Array like this in scala-spark:
Array[(String,Int)]= Array((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))
and i need to group it by the first parameter A1 or A2 or A3 so as each of these be a list containing numbers respectively like this:
List( A1:(1,10), A2:(5,5), A3:(3) )
please help me

Considering it as an RDD, we can do it following way.
scala> val x = List(("A1:B",1),("A1:A",10),("A2:C",5),("A2:E",5),("A3:D",3))
x: List[(String, Int)] = List((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))
scala> x.map( a=> (a._1.split(":"),a._2))
res1: List[(Array[String], Int)] = List((Array(A1, B),1), (Array(A1, A),10), (Array(A2, C),5), (Array(A2, E),5), (Array(A3, D),3))
scala> res1.map(a => (a._1(0),a._2))
res12: List[(String, Int)] = List((A1,1), (A1,10), (A2,5), (A2,5), (A3,3))
scala> val rdd = sc.makeRDD(res12)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:33
scala> rdd.groupByKey()
res13: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[16] at groupByKey at <console>:36
scala> res13.collect
res14: Array[(String, Iterable[Int])] = Array((A3,CompactBuffer(3)), (A1,CompactBuffer(1, 10)), (A2,CompactBuffer(5, 5)))

You can try this:
val data = Array(("A1:B", 1), ("A1:A", 10), ("A2:C", 5), ("A2:E", 5), ("A3:D", 3))
val grpData = data.groupBy(f => f._1.split(":")(0)).map(x => (x._1 + ":(" + x._2.map(_._2).mkString(",") + ")")).toList
println(grpData)

Related

The operation about Merge two tuples

Now there are two thus tuples.
1st tuple:((A,1),(B,3),(D,5)......)
2nd tuple:((A,3),(B,1),(E,6)......)
And the function is to merge those two tuples to this.
((A,1,3),(B,3,1),(D,5,0),(E,0,6)......)
If the first tuple contains a key that is not in the second tuple, set the value to 0, and vice versa. How could I code this function in scala?
Lets say you get the input in the format
val tuple1: List[(String, Int)] = List(("A",1),("B",3),("D",5),("E",0))
val tuple2: List[(String, Int)] = List(("A",3),("B",1),("D",6))
You can write a merge function as
def merge(tuple1: List[(String, Int)],tuple2: List[(String, Int)]) =
{
val map1 = tuple1.toMap
val map2 = tuple2.toMap
map1.map{ case (k,v) =>
(k,v,map2.get(k).getOrElse(0))
}
}
On calling the function
merge(tuple1,tuple2)
You will get the output as
res0: scala.collection.immutable.Iterable[(String, Int, Int)] = List((A,1,3), (B,3,1), (D,5,6), (E,0,0))
Please let me know if that answers your question.
val t1= List(("A",1),("B",3),("D",5),("E",0))
val t2= List(("A",3),("B",1),("D",6),("H",5))
val t3 = t2.filter{ case (k,v) => !t1.exists(case (k1,_) => k1==k)) }.map{case (k,_) => (k,0)}
val t4 = t1.filter{ case (k,v) => !t2.exists{case (k1,_) => k1==k} }.map{case (k,_) => (k,0)}
val t5=(t1 ++ t3).sortBy{case (k,v) => k}
val t6=(t2 ++ t4).sortBy{case (k,v) => k}
t5.zip(t6).map{case ((k,v1),(_,v2)) => (k,v1,v2) }
res: List[(String, Int, Int)] = List(("A", 1, 3), ("B", 3, 1), ("D", 5, 6), ("E", 0, 0), ("H", 0, 5))
In terms of what's happening here
t3 and t4 - find the missing value in t1 and t2 respectively and add them with a zero value
t5 and t6 sort the unified list (t1 with t3 and t2 with t4). Lastly they are zipped together and transformed to the desired output

Append/concatenate two RDDs of type Set in Apache Spark

I am working with Spark RDD. I need to append/concatenate two RDDs of type Set.
scala> var ek: RDD[Set[Int]] = sc.parallelize(Seq(Set(7)))
ek: org.apache.spark.rdd.RDD[Set[Int]] = ParallelCollectionRDD[31] at parallelize at <console>:32
scala> val vi: RDD[Set[Int]] = sc.parallelize(Seq(Set(3,5)))
vi: org.apache.spark.rdd.RDD[Set[Int]] = ParallelCollectionRDD[32] at parallelize at <console>:32
scala> val z = vi.union(ek)
z: org.apache.spark.rdd.RDD[Set[Int]] = UnionRDD[34] at union at <console>:36
scala> z.collect
res15: Array[Set[Int]] = Array(Set(3, 5), Set(7))
scala> val t = visited++ek
t: org.apache.spark.rdd.RDD[Set[Int]] = UnionRDD[40] at $plus$plus at <console>:36
scala> t.collect
res30: Array[Set[Int]] = Array(Set(3, 5), Set(7))
I have tried using two operators, union and ++. However, it does not produce the expected result.
Array(Set(3, 5), Set(7))
The expected result should be like this:
scala> val u = Set(3,5)
u: scala.collection.immutable.Set[Int] = Set(3, 5)
scala> val o = Set(7)
o: scala.collection.immutable.Set[Int] = Set(7)
scala> u.union(o)
res28: scala.collection.immutable.Set[Int] = Set(3, 5, 7)
Can anybody give me direction how to do it
You are applying the union on a list (seq) of sets that is why the elements are the complete sets and not their elements. Try using:
var ek = sc.parallelize(Set(7).toSeq)
val vi = sc.parallelize(Set(3,5).toSeq)
val z = vi.union(ek)

How to merge two Seq[String], Seq[Double] to Seq[(String,Double)]

I have two Seq.
1 has Seq[String] and another has Seq[(String,Double)]
a -> ["a","b","c"] and
b-> [1,2,3]
I want to create output as
[("a",1),("b",2),("c",3)]
I have a code
a.zip(b) is actually creating a seq of those two elements instead of creating a map
Can anyone suggest how to do that in scala?
you simply need .toMap so that you can transform List[Tuple[String, Int]] to Map[String, Int]
scala> val seq1 = List("a", "b", "c")
seq1: List[String] = List(a, b, c)
scala> val seq2 = List(1, 2, 3)
seq2: List[Int] = List(1, 2, 3)
scala> seq1.zip(seq2)
res0: List[(String, Int)] = List((a,1), (b,2), (c,3))
scala> seq1.zip(seq2).toMap
res1: scala.collection.immutable.Map[String,Int] = Map(a -> 1, b -> 2, c -> 3)
also see
How to convert a Seq[A] to a Map[Int, A] using a value of A as the key in the map?

Error: type mismatch flatMap

I am new to spark programming and scala and i am not able to understand the difference between map and flatMap.
I tried below code as i was expecting both to work but got error.
scala> val b = List("1","2", "4", "5")
b: List[String] = List(1, 2, 4, 5)
scala> b.map(x => (x,1))
res2: List[(String, Int)] = List((1,1), (2,1), (4,1), (5,1))
scala> b.flatMap(x => (x,1))
<console>:28: error: type mismatch;
found : (String, Int)
required: scala.collection.GenTraversableOnce[?]
b.flatMap(x => (x,1))
As per my understanding flatmap make Rdd in to collection for String/Int Rdd.
I was thinking that in this case both should work without any error.Please let me know where i am making the mistake.
Thanks
You need to look at how the signatures defined these methods:
def map[U: ClassTag](f: T => U): RDD[U]
map takes a function from type T to type U and returns an RDD[U].
On the other hand, flatMap:
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
Expects a function taking type T to a TraversableOnce[U], which is a trait Tuple2 doesn't implement, and returns an RDD[U]. Generally, you use flatMap when you want to flatten a collection of collections, i.e. if you had an RDD[List[List[Int]] and you want to produce a RDD[List[Int]] you can flatMap it using identity.
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
The following example might be helpful.
scala> val b = List("1", "2", "4", "5")
b: List[String] = List(1, 2, 4, 5)
scala> b.map(x=>Set(x,1))
res69: List[scala.collection.immutable.Set[Any]] =
List(Set(1, 1), Set(2, 1), Set(4, 1), Set(5, 1))
scala> b.flatMap(x=>Set(x,1))
res70: List[Any] = List(1, 1, 2, 1, 4, 1, 5, 1)
scala> b.flatMap(x=>List(x,1))
res71: List[Any] = List(1, 1, 2, 1, 4, 1, 5, 1)
scala> b.flatMap(x=>List(x+1))
res75: scala.collection.immutable.Set[String] = List(11, 21, 41, 51) // concat
scala> val x = sc.parallelize(List("aa bb cc dd", "ee ff gg hh"), 2)
scala> val y = x.map(x => x.split(" ")) // split(" ") returns an array of words
scala> y.collect
res0: Array[Array[String]] = Array(Array(aa, bb, cc, dd), Array(ee, ff, gg, hh))
scala> val y = x.flatMap(x => x.split(" "))
scala> y.collect
res1: Array[String] = Array(aa, bb, cc, dd, ee, ff, gg, hh)
Map operation return type is U where as flatMap return type is TraversableOnce[U](means collections)
val b = List("1", "2", "4", "5")
val mapRDD = b.map { input => (input, 1) }
mapRDD.foreach(f => println(f._1 + " " + f._2))
val flatmapRDD = b.flatMap { input => List((input, 1)) }
flatmapRDD.foreach(f => println(f._1 + " " + f._2))
map does a 1-to-1 transformation, while flatMap converts a list of lists to a single list:
scala> val b = List(List(1,2,3), List(4,5,6), List(7,8,90))
b: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6), List(7, 8, 90))
scala> b.map(x => (x,1))
res1: List[(List[Int], Int)] = List((List(1, 2, 3),1), (List(4, 5, 6),1), (List(7, 8, 90),1))
scala> b.flatMap(x => x)
res2: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 90)
Also, flatMap is useful for filtering out None values if you have a list of Options:
scala> val c = List(Some(1), Some(2), None, Some(3), Some(4), None)
c: List[Option[Int]] = List(Some(1), Some(2), None, Some(3), Some(4), None)
scala> c.flatMap(x => x)
res3: List[Int] = List(1, 2, 3, 4)

How to sum a list of tuples

Given the following list of tuples...
val list = List((1, 2), (1, 2), (1, 2))
... how do I sum all the values and obtain a single tuple like this?
(3, 6)
Using the foldLeft method. Please look at the scaladoc for more information.
scala> val list = List((1, 2), (1, 2), (1, 2))
list: List[(Int, Int)] = List((1,2), (1,2), (1,2))
scala> list.foldLeft((0, 0)) { case ((accA, accB), (a, b)) => (accA + a, accB + b) }
res0: (Int, Int) = (3,6)
Using unzip. Not as efficient as the above solution. Perhaps more readable.
scala> list.unzip match { case (l1, l2) => (l1.sum, l2.sum) }
res1: (Int, Int) = (3,6)
Very easy: (list.map(_._1).sum, list.map(_._2).sum).
You can solve this using Monoid.combineAll from the cats library:
import cats.instances.int._ // For monoid instances for `Int`
import cats.instances.tuple._ // for Monoid instance for `Tuple2`
import cats.Monoid.combineAll
def main(args: Array[String]): Unit = {
val list = List((1, 2), (1, 2), (1, 2))
val res = combineAll(list)
println(res)
// Displays
// (3, 6)
}
You can see more about this in the cats documentation or Scala with Cats.
answering to this question while trying to understand aggregate function in spark
scala> val list = List((1, 2), (1, 2), (1, 2))
list: List[(Int, Int)] = List((1,2), (1,2), (1,2))
scala> list.aggregate((0,0))((x,y)=>((y._1+x._1),(x._2+y._2)),(x,y)=>(x._1+y._2,y._2+x._2))
res89: (Int, Int) = (3,6)
Here is the link to the SO QA that helped to understand and answer this [Explain the aggregate functionality in Spark
Scalaz solution (suggestied by Travis and for some reason a deleted answer):
import scalaz._
import Scalaz._
val list = List((1, 2), (1, 2), (1, 2))
list.suml
which outputs
res0: (Int, Int) = (3,6)
You can also use a reduce function :
val list = List((1, 2), (1, 2), (1, 2))
val res = list.reduce((x, y) => (x._1 + y._1, x._2 + y._2))