Use combineByKey to get output as (key, iterable[values]) - scala

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.

If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

Related

How to use flatMap for flatten one component of a tuple

I have a tuple like.. (a, list(b,c,d)). I want the output like
(a,b)
(a,c)
(a,d)
I am trying to use flatMap for this but not getting any success. Even map is not helping in this case.
Input Data :
Chap01:Spark is an emerging technology
Chap01:You can easily learn Spark
Chap02:Hadoop is a Bigdata technology
Chap02:You can easily learn Spark and Hadoop
Code:
val rawData = sc.textFile("C:\\wc_input.txt")
val chapters = rawData.map(line => (line.split(":")(0), line.split(":")(1)))
val chapWords = chapters.flatMap(a => (a._1, a._2.split(" ")))
You could map over the second element of the tuple:
val t = ('a', List('b','c','d'))
val res = t._2.map((t._1, _))
The snipped above resolves to:
res: List[(Char, Char)] = List((a,b), (a,c), (a,d))
This scenario can be easily handled by flatMapValues methods in RDD. It works only on values of a pair RDD keeping the key same.

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Optimizing cartesian product using keys in spark

To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.:
Input [(k1, v1), (k1, v2), (k2, v3)]
Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Here is code I have tried executing:
val input = sc.textFile('data.csv')
val rdd = input.map(s=>s.split(","))
.map(s => (s(1).toString, s(2).toString))
val group_result:RDD[String, Iterable[String]] = rdd.groupByKey()
group_result.flatMap { t =>
{
val stream1= t._2.toStream
val stream2= t._2.toStream
stream1.flatMap { src =>
stream2.par.map { trg =>
src + "," + trg
}
}
}
}
This works fine for very small files, but when the list(Iterable) is of length ~1000 the computation completely freezes.
As #zero323 said, the best way to solve this is by using PairRDDFunctions' method join, however in order to achieve this you need to have a PairedRDD, which can be obtained by using RDD's method keyBy.
You could do something like:
val rdd = sc.parallelize(Array(("k1", "v1"), ("k1", "v2"), ("k2", "v3"))).keyBy(_._1)
val result = rdd.join(rdd).map{
case (key: String, (x: Tuple2[String, String], y: Tuple2[String, String])) => (x._2, y._2)
}
result.take(20)
// res9: Array[(String, String)] = Array((v1,v1), (v1,v2), (v2,v1), (v2,v2), (v3, v3))
Here I share the notebook with the code.

what's the best practice to merge rdds in scala

I have got multi RDDs as result and want to merge them, they are of the same format:
RDD(id, HashMap[String, HashMap[String, Int]])
^ ^ ^
| | |
identity category distribution of the category
Here is a example of that rdd:
(1001, {age={10=3,15=5,16=8, ...}})
The first keyString of the HashMap[String, HashMap] is the category of the statistic and the HashMap[String, Int] in the HashMap[String, HashMap] is the distribution of the category. After calculate each distribution of vary categories, I want to merge them by the identity so that I can store the results to database. Here is what I got currently:
def mergeRDD(rdd1: RDD[(String, util.HashMap[String, Object])],
rdd2:RDD[(String, util.HashMap[String, Object])]): RDD[(String, util.HashMap[String, Object])] = {
val mergedRDD = rdd1.join(rdd2).map{
case (id, (m1, m2)) => {
m1.putAll(m2)
(id, m1)
}
}
mergedRDD
}
val mergedRDD = mergeRDD(provinceRDD, mergeRDD(mergeRDD(levelRDD, genderRDD), actionTypeRDD))
I write a function mergeRDD so that I can merge two rdds each time, But I found that function is not very elegant, as a newbie to scala, any inspiring is appreciated.
I don't see any easy way to achieve this, without hitting performance.
Reason being, you are not simply merging two rdd, rather, you want your hashmap to have consolidated values after union of rdd.
Now, your merge function is wrong. In current state join will actually do inner join, missing out rows present in either rdd not present in other one.
Correct way would be something like.
val mergedRDD = rdd1.union(rdd2).reduceByKey{
case (m1, m2) => {
m1.putAll(m2)
}
}
You may replace the java.util.HashMap with scala.collection.immutable.Map
From there:
val rdds = List(provinceRDD, levelRDD, genderRDD, actionTypeRDD)
val unionRDD = rdds.reduce(_ ++ _)
val mergedRDD = unionRDD.reduceByKey(_ ++ _)
This is assuming that categories don't overlap between rdds.

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)