How do reduce tuples into tuple of tuples in scala - scala

I have RDD with rows of type
(a,(b,c,d))
(a,(e,f,g))
I am trying to reduce it by key such that it yields rows of type
(a,(b,c,d),(e,f,g)).
But I am getting an error while using this :
val redcd = mapd.reduceByKey((_,_))
How do I do it?

If you have RDD as
scala> mapd.foreach(println)
(a,(b,c,d))
(a,(e,f,g))
(b,(b,c,d))
Then doing
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList)
would give you
scala> redcd.foreach(println)
(b,List((b,c,d)))
(a,List((b,c,d), (e,f,g)))
Now if you want it in the format explained in question you can do
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList.mkString(", "))
which would generate
scala> redcd.foreach(println)
(a,(b,c,d), (e,f,g))
(b,(b,c,d))
I hope the answer is helpful

Related

How to use flatMap for flatten one component of a tuple

I have a tuple like.. (a, list(b,c,d)). I want the output like
(a,b)
(a,c)
(a,d)
I am trying to use flatMap for this but not getting any success. Even map is not helping in this case.
Input Data :
Chap01:Spark is an emerging technology
Chap01:You can easily learn Spark
Chap02:Hadoop is a Bigdata technology
Chap02:You can easily learn Spark and Hadoop
Code:
val rawData = sc.textFile("C:\\wc_input.txt")
val chapters = rawData.map(line => (line.split(":")(0), line.split(":")(1)))
val chapWords = chapters.flatMap(a => (a._1, a._2.split(" ")))
You could map over the second element of the tuple:
val t = ('a', List('b','c','d'))
val res = t._2.map((t._1, _))
The snipped above resolves to:
res: List[(Char, Char)] = List((a,b), (a,c), (a,d))
This scenario can be easily handled by flatMapValues methods in RDD. It works only on values of a pair RDD keeping the key same.

Use combineByKey to get output as (key, iterable[values])

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.
If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

Spark: Using mapPartition with Scala

Lets say I am having the following dataframe:
var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()
and I am having this function which will be an input for the mapPartition:
def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
row.toArray.tail.toIterator
And using the map partition:
df.mapPartition(trialIterator)
I am having the following error message:
Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]
I can understand that this is happening due to the input, output type of my function but how to solve this?
If you want to get strongly typed input don't use Dataset[Row] (DataFrame) but Dataset[T] where T in this particular scenario is (String, Int). Also don't convert to Array and don't call blindly tail without knowing if partition is empty:
def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)
randomData
.toDS // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
or
randomData.toDF // org.apache.spark.sql.Dataset[Row]
.as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]
def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
row.next()
row //seems to do the same thing w/o all the conversions
}

How to iterate records spark scala?

I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile.
When I do
myrdd.first_1.datum.getName()
I can get the name. Problem is, I have 10 records in "myrdd". When I do:
myrdd.map(x => {println(x._1.datum.getName())})
it does not work and prints out a weird object a single time. How can I iterate over all records?
Here is a log from a session using spark-shell with a similar scenario.
Given
scala> persons
res8: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.first
res7: org.apache.spark.sql.Row = [Justin,19]
Your issue looks like
scala> persons.map(t => println(t))
res4: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[10]
so map just returns another RDD (the function is not applied immediately, the function is applied "lazily" when you really iterate over the result).
So when you materialize (using collect()) you get a "normal" collection:
scala> persons.collect()
res11: Array[org.apache.spark.sql.Row] = Array([Justin,19])
over which which you can map. Note that in this case you have a side-effect in the closure passed to map (the println), the result of println is Unit):
scala> persons.collect().map(t => println(t))
[Justin,19]
res5: Array[Unit] = Array(())
Same result if collect is applied at the end:
scala> persons.map(t => println(t)).collect()
[Justin,19]
res19: Array[Unit] = Array(())
But if you just want to print the rows, you can simplify it to using foreach:
scala> persons.foreach(t => println(t))
[Justin,19]
As #RohanAletty has pointed out in a comment, this works for a local Spark job. If the job runs in a cluster, collect is required as well:
persons.collect().foreach(t => println(t))
Notes
The same behaviour can be observed in the Iterator class.
The output of the session above has been reordered
Update
As for filtering: The location of collect is "bad", if you apply filters after collect which can be applied before.
For example these expressions give the same result:
scala> persons.filter("age > 20").collect().foreach(println)
[Michael,29]
[Andy,30]
scala> persons.collect().filter(r => r.getInt(1) >= 20).foreach(println)
[Michael,29]
[Andy,30]
but the 2nd case is worse, because that filter could have been applied before collect.
The same applies to any type of aggregation as well.

Scala :- Gatling :- Concatenation of two Maps stores last value only and ignores all other values

I have a two Maps and I want to concatenate them.
I tried almost all example given here Best way to merge two maps and sum the values of same key? but it ignores all values for key metrics and only stores last value.
I have downloaded scalaz-full_2.9.1-6.0.3.jar and imported it import scalaz._ but it won't works for me.
How can I concate this two maps with multiple values to same keys ?
Edit :-
Now I tried
val map = new HashMap[String, Set[String]] with MultiMap[String, String]
map.addBinding("""report_type""" , """performance""")
map.addBinding("""start_date""" ,start_date)
map.addBinding("""end_date""" , end_date)
map.addBinding("metrics" , "plays")
map.addBinding("metrics", "displays")
map.addBinding("metrics" , "video_starts")
map.addBinding("metrics" , "playthrough_25")
map.addBinding("metrics", "playthrough_50")
map.addBinding("metrics", "playthrough_75")
map.addBinding("metrics", "playthrough_100")
val map1 = new HashMap[String, Set[String]] with MultiMap[String, String]
map1.addBinding("""dimensions""" , """asset""")
map1.addBinding("""limit""" , """50""")
And tried to conver this mutable maps to immutable type using this link as
val asset_query_string = map ++ map1
val asset_query_string_map =(asset_query_string map { x=> (x._1,x._2.toSet) }).toMap[String, Set[String]]
But still I get
i_ui\config\config.scala:51: Cannot prove that (String, scala.collection.immutable.Set[String]) <:< (St
ring, scala.collection.mutable.Set[String]).
11:10:13.080 [ERROR] i.g.a.ZincCompiler$ - val asset_query_string_map =(asset_query_string map { x=> (x
._1,x._2.toSet) }).toMap[String, Set[String]]
Your problem is not related with a concatenation but with a declaration of the metrics map. It's not possible to have multiple values for a single key in a Map. Perhaps you should look at this collection:
http://www.scala-lang.org/api/2.10.3/index.html#scala.collection.mutable.MultiMap
You can't have duplicate keys in a Map.
for simple map it is impossible to have duplicates keys,if you have the duplicates keys in the map it takes the last one
but you can use MultiMap
import collection.mutable.{ HashMap, MultiMap, Set }
val mm = new HashMap[String, Set[String]] with MultiMap[String, String]
mm.addBinding("metrics","plays")
mm.addBinding("metrics","displays")
mm.addBinding("metrics","players")
println(mm,"multimap")//(Map(metrics -> Set(players, plays, displays)),multimap)
I was able to create two MultiMaps but when I tried to concatenate val final_map = map1 ++ map2
and I tried answer given here Mutable MultiMap to immutable Map
But my problem was not solved, I got
config\config.scala:51: Cannot prove that (String, scala.collection.immutable.Set[String]) <:< (St
ring, scala.collection.mutable.Set[String]).
finally it solved by
val final_map = map1 ++ map2
val asset_query_string_map = final_map.map(kv => (kv._1,kv._2.toSet)).toMap