How to use flatMap for flatten one component of a tuple - scala

I have a tuple like.. (a, list(b,c,d)). I want the output like
(a,b)
(a,c)
(a,d)
I am trying to use flatMap for this but not getting any success. Even map is not helping in this case.
Input Data :
Chap01:Spark is an emerging technology
Chap01:You can easily learn Spark
Chap02:Hadoop is a Bigdata technology
Chap02:You can easily learn Spark and Hadoop
Code:
val rawData = sc.textFile("C:\\wc_input.txt")
val chapters = rawData.map(line => (line.split(":")(0), line.split(":")(1)))
val chapWords = chapters.flatMap(a => (a._1, a._2.split(" ")))

You could map over the second element of the tuple:
val t = ('a', List('b','c','d'))
val res = t._2.map((t._1, _))
The snipped above resolves to:
res: List[(Char, Char)] = List((a,b), (a,c), (a,d))

This scenario can be easily handled by flatMapValues methods in RDD. It works only on values of a pair RDD keeping the key same.

Related

Filename usage with Spark RDD

I am getting a filename in my RDD when i use :
val file=sc.wholeTextFiles("file:///c://samples//finalout.txt",0)
However, right after I flatten it, I loose the first tuple, how do I make sure that filename is carried forward to my map function?
My code:
val res= file.flatMap{e=>e._2.split("\n")}.map{line => line.split(",")}.map(elem => {
...I want to use filename here
}
A slightly different approach. It should be all possible with just RDD, but I use a DF to RDD approach. It's a bit late in the evening.
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = "/FileStore/tables/sample_text.txt" //does work, also for directory of course
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.map(line => (line._1, line._2.split("\n")))
//rdd2.collect
val rdd3 = rdd2.map(line => (line._1, line._2.flatMap(x => x.split(" ")).map(x => (x,1))))
rdd3.collect
returns in this case:
res61: Array[(String, Array[(String, Int)])] = Array((dbfs:/FileStore/tables/sample_text.txt,Array((Hi,1), (how,1), (are,1), (you,1), (today,1), (ZZZ,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((I,1), (am,1), (fine,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((I,1), (am,1), (also,1), (tired,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((You,1), (look,1), (good,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Can,1), (I,1), (stay,1), (with,1), (you?,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Bob,1), (will,1), (pop,1), (in,1), (later,1), (ZZZ,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Oh,1), (really?,1), (Nice,,1), (cool,1))))
You can modify accordingly, stripping out this and that. line._1 is your file_name. For RDDs use case statement approach.
UPD
Using your approach - and correcting it - as you cannot do what you want, the following random code as not sure what u r wishing to do:
val file = sc.wholeTextFiles("/FileStore/tables/sample_text.txt",0)
val res = file.map(line => (line._1, line._2.split("\n").flatMap(x => x.split(" ")) )).map(elem => {(elem._1,elem._2, "This is also possible") })
res.collect
returns:
res: org.apache.spark.rdd.RDD[(String, Array[String], String)] =
MapPartitionsRDD[69] at map at command-1038477736158514:15
res9: Array[(String, Array[String], String)] =
Array((dbfs:/FileStore/tables/sample_text.txt,Array(Hi, how, are, you, today, "ZZZ
", I, am, "fine
", I, am, also, "tired
", You, look, "good
", Can, I, stay, with, "you?
", Bob, will, pop, in, later, "ZZZ
", Oh, really?, Nice,, "cool
"),This is also possible))
Your approach is not possible with flatMap, need to follow as per above.

How do reduce tuples into tuple of tuples in scala

I have RDD with rows of type
(a,(b,c,d))
(a,(e,f,g))
I am trying to reduce it by key such that it yields rows of type
(a,(b,c,d),(e,f,g)).
But I am getting an error while using this :
val redcd = mapd.reduceByKey((_,_))
How do I do it?
If you have RDD as
scala> mapd.foreach(println)
(a,(b,c,d))
(a,(e,f,g))
(b,(b,c,d))
Then doing
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList)
would give you
scala> redcd.foreach(println)
(b,List((b,c,d)))
(a,List((b,c,d), (e,f,g)))
Now if you want it in the format explained in question you can do
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList.mkString(", "))
which would generate
scala> redcd.foreach(println)
(a,(b,c,d), (e,f,g))
(b,(b,c,d))
I hope the answer is helpful

Use combineByKey to get output as (key, iterable[values])

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.
If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

what's the best practice to merge rdds in scala

I have got multi RDDs as result and want to merge them, they are of the same format:
RDD(id, HashMap[String, HashMap[String, Int]])
^ ^ ^
| | |
identity category distribution of the category
Here is a example of that rdd:
(1001, {age={10=3,15=5,16=8, ...}})
The first keyString of the HashMap[String, HashMap] is the category of the statistic and the HashMap[String, Int] in the HashMap[String, HashMap] is the distribution of the category. After calculate each distribution of vary categories, I want to merge them by the identity so that I can store the results to database. Here is what I got currently:
def mergeRDD(rdd1: RDD[(String, util.HashMap[String, Object])],
rdd2:RDD[(String, util.HashMap[String, Object])]): RDD[(String, util.HashMap[String, Object])] = {
val mergedRDD = rdd1.join(rdd2).map{
case (id, (m1, m2)) => {
m1.putAll(m2)
(id, m1)
}
}
mergedRDD
}
val mergedRDD = mergeRDD(provinceRDD, mergeRDD(mergeRDD(levelRDD, genderRDD), actionTypeRDD))
I write a function mergeRDD so that I can merge two rdds each time, But I found that function is not very elegant, as a newbie to scala, any inspiring is appreciated.
I don't see any easy way to achieve this, without hitting performance.
Reason being, you are not simply merging two rdd, rather, you want your hashmap to have consolidated values after union of rdd.
Now, your merge function is wrong. In current state join will actually do inner join, missing out rows present in either rdd not present in other one.
Correct way would be something like.
val mergedRDD = rdd1.union(rdd2).reduceByKey{
case (m1, m2) => {
m1.putAll(m2)
}
}
You may replace the java.util.HashMap with scala.collection.immutable.Map
From there:
val rdds = List(provinceRDD, levelRDD, genderRDD, actionTypeRDD)
val unionRDD = rdds.reduce(_ ++ _)
val mergedRDD = unionRDD.reduceByKey(_ ++ _)
This is assuming that categories don't overlap between rdds.

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)