Spark shell using combineByKey with Object? - scala

I have created simple dataset finding the average. Found the way using tuple with combineByKey option. Final result set like this (key,(total,no.of values))
scala> mydata.combineByKey( value => (value,1) , (acc:(Int,Int),value) => (acc._1+value,acc._2+1),(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1 + acc2._1 , acc2._2 + acc2._2))
res75: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[42] at combineByKey at <console>:36
scala> res75.take(10)
res77: Array[(String, (Int, Int))] = Array((FWA,(309,1)), (SMX,(62,1)), (BMI,(91,2)), (HLN,(119,1)), (SUN,(118,1)), (HYS,(52,1)), (RIC,(1156,8)), (PSE,(72,1)), (SLC,(8699,8)), (EWN,(55,1)))
Finding average value for FWA, SMX and so on, is working fine with tuple and combineByKey option.
Same thing I tried with object. Created object fd with two fields, name and delay.
scala> case class **fd**(name:String,delay:Int)
defined class fd
scala> **data**.take(2)
res73: Array[fd] = Array(**fd**(DFW,11956), fd(DTW,588))
In the above RDD, how can I use combineByKey option? Since it is not key and value pair.
Please suggest me the way how to find average? Where can I find some advanced spark programming for my study?

Related

Spark RDD - CountByValue - Map type - order by key

From spark RDD - countByValue is returning Map Datatype and want to sort by key ascending/ descending .
val s = flightsObjectRDD.map(_.dep_delay / 60 toInt).countByValue() // RDD type is action and returning Map datatype
s.toSeq.sortBy(_._1)
The above code is working as expected. But countByValue itself have implicit sorting . How can i implement that way?
You exit the Big Data realm and get into Scala itself. And then into all those structures that are immutable, sorted, hashed and mutable, or a combination of these. I think that is the reason for the -1 initially. Nice folks out there, anyway.
Take this example, the countByValue returns a Map to the Driver, so only of interest for small amounts of data. Map is also (key, value) pair but with hashing and immutable. So we need to manipulate it. This is what you can do. First up you can sort the Map on the key in ascending order.
val rdd1 = sc.parallelize(Seq(("HR",5),("RD",4),("ADMIN",5),("SALES",4),("SER",6),("MAN",8),("MAN",8),("HR",5),("HR",6),("HR",5)))
val map = rdd1.countByValue
val res1 = ListMap(map.toSeq.sortBy(_._1):_*) // ascending sort on key part of Map
res1: scala.collection.immutable.ListMap[(String, Int),Long] = Map((ADMIN,5) -> 1, (HR,5) -> 3, (HR,6) -> 1, (MAN,8) -> 2, (RD,4) -> 1, (SALES,4) -> 1, (SER,6) -> 1)
However, you cannot apply reverse or descending logic on the key as it is hashing. Next best thing is as follows:
val res2 = map.toList.sortBy(_._1).reverse
val res22 = map.toSeq.sortBy(_._1).reverse
res2: List[((String, Int), Long)] = List(((SER,6),1), ((SALES,4),1), ((RD,4),1), ((MAN,8),2), ((HR,6),1), ((HR,5),3), ((ADMIN,5),1))
res22: Seq[((String, Int), Long)] = ArrayBuffer(((SER,6),1), ((SALES,4),1), ((RD,4),1), ((MAN,8),2), ((HR,6),1), ((HR,5),3), ((ADMIN,5),1))
But you cannot apply the .toMap against the .reverse here, as it will hash and lose the sort. So, you must make a compromise.

Group By error 33 : wrong number of parameter

I am kind of new to Spark and programming too and i am not able to understand how to deal with Rdd of type rdd.RDD[(Int, Iterable[Double])] = ShuffledRDD[10] at groupByKey . I am little bit interested in learning groupByKey in spark and i have a filtered RDD
scala> p.first
res11: (Int, Double) = (1,299.98)
I go the above result after applying GroupByKEy instead of reduceByKey now i have rdd of type (Int, Iterable[Double]) and i want to get result like (Int , sum (Double)).
I have tried this but got the error.
scala> val price = g.map((a,b) => (a, sum(b)))
<console>:33: error: wrong number of parameters; expected = 1
val price = g.map((a,b) => (a, sum(b)))
Please suggest and help me in this to understand it
g.mapValues(_.sum), which is short for g.map { case (k, v) => (k, v.sum) }

Spark RDD map internal object to Row

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.
I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Scala Spark - Reduce RDD by adding multiple values per key

I have a Spark RDD that is in the format of (String, (Int, Int)) and I would like to add the Int values together to create a (String, Int) map.
This is an example of an element in my RDD:
res13: (String, (Int, Int)) = (9D4669B432A0FD,(1,1))
I would like to end with an RDD of (String, Int) = (9D4669B432A0FD,2)
You should just map the values to the sum of the second pair:
yourRdd.map(pair => (pair._1, pair._2._1 + pair._2._2))
#marios suggested the following nicer syntax in an edit:
Or if you want to make it a bit more readable:
yourRdd.map{case(str, (x1,x2)) => (str, x1+x2)}
Gabor Bakos answer is correct if there are unique keys. But If you have multiple identical keys and if you want to reduce it to unique keys then use reduceByKey.
Example:
val data = Array(("9888wq",(1,2)),("abcd",(1,1)),("abcd",(3,2)),("9888wq",(4,2)))
val rdd= sc.parallelize(data)
val result = rdd.map(x => (x._1,(x._2._1+x._2._2))).reduceByKey((x,y) => x+y)
result.foreach(println)
Output :
(9888wq,9)
(abcd,7)