Spark RDD map internal object to Row - scala

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.

I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

Related

Spark shell using combineByKey with Object?

I have created simple dataset finding the average. Found the way using tuple with combineByKey option. Final result set like this (key,(total,no.of values))
scala> mydata.combineByKey( value => (value,1) , (acc:(Int,Int),value) => (acc._1+value,acc._2+1),(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1 + acc2._1 , acc2._2 + acc2._2))
res75: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[42] at combineByKey at <console>:36
scala> res75.take(10)
res77: Array[(String, (Int, Int))] = Array((FWA,(309,1)), (SMX,(62,1)), (BMI,(91,2)), (HLN,(119,1)), (SUN,(118,1)), (HYS,(52,1)), (RIC,(1156,8)), (PSE,(72,1)), (SLC,(8699,8)), (EWN,(55,1)))
Finding average value for FWA, SMX and so on, is working fine with tuple and combineByKey option.
Same thing I tried with object. Created object fd with two fields, name and delay.
scala> case class **fd**(name:String,delay:Int)
defined class fd
scala> **data**.take(2)
res73: Array[fd] = Array(**fd**(DFW,11956), fd(DTW,588))
In the above RDD, how can I use combineByKey option? Since it is not key and value pair.
Please suggest me the way how to find average? Where can I find some advanced spark programming for my study?

Sorting and Merging Using spark-shell

I have a string array in scala
Array[String] = Array(apple, banana, oranges, grapes, lichi, anar)
I have converted it into a format like this:
Array[(Int, String)] = Array((5,apple), (6,banana), (7,oranges), (6,grapes), (5,lichi), (4,anar))
and i want output like this:
Array[(Int, String)] = Array((4,anar), (5,applelichi), (6,bananagrapes), (7,oranges))
means after sorting i want to add together the words with same key.
i have done sorting. heres my code:
val a = sc.parallelize(List("apple","banana","oranges","grapes","lichi","anar"))
val b = a.map(x =>(x.length,x))
val c = b.sortBy(_._2)
You can use groupByKey() to do this and then merge the lists you get with mkString. Small example using what you have (a,b are the same):
val c = b.groupByKey().map{case (key, list) => (key, list.toList.sorted.mkString)}.sortBy(_._1)
c.collect() foreach println
Which will give you:
(4,anar)
(5,applelichi)
(6,bananagrapes)
(7,oranges)

Spark Sql data from org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

I have org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] data,
how to print data or get data?
I have code like:
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()
val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
The above code is returning org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
In your first step, you have .toDF() extra. Correct one is as below
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")
In your second step, you missed .rdd so the actual second step is
val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))
which has dataType as you mentioned in the question as
scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25
to view the groupByData rdd you can simply use foreach as
groupByData.foreach(println)
which would give you
((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))
Now your third step is filtering data which has day1 as value for day column in your dataframe. And you are taking only the values of the grouped rdd data.
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
the returned dataType for this step is
scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27
You can use foreach as above to view the data as
filterData.foreach(println)
which would give you
CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])
You can see that the returned dataType is an RDD[Iterable[org.apache.spark.sql.Row]] so you can print each values using a map as
filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect
which would give you
(day1,user1,session1,100.0)
(day1,user1,session2,200.0)
if you do only
filterData.map(x => x.map(y => println(y(0), y(3)))).collect
you would get
(day1,100.0)
(day1,200.0)
I hope the answer is helpful

Group By error 33 : wrong number of parameter

I am kind of new to Spark and programming too and i am not able to understand how to deal with Rdd of type rdd.RDD[(Int, Iterable[Double])] = ShuffledRDD[10] at groupByKey . I am little bit interested in learning groupByKey in spark and i have a filtered RDD
scala> p.first
res11: (Int, Double) = (1,299.98)
I go the above result after applying GroupByKEy instead of reduceByKey now i have rdd of type (Int, Iterable[Double]) and i want to get result like (Int , sum (Double)).
I have tried this but got the error.
scala> val price = g.map((a,b) => (a, sum(b)))
<console>:33: error: wrong number of parameters; expected = 1
val price = g.map((a,b) => (a, sum(b)))
Please suggest and help me in this to understand it
g.mapValues(_.sum), which is short for g.map { case (k, v) => (k, v.sum) }

Spark how to transform RDD[Seq[(String, String)]] to RDD[(String, String)]

I have a Spark RDD[Seq[(String,String)]] which contains several group of two words. Now I have to save them to a file in HDFS like this (no matter in which Seq they are):
dog cat
cat mouse
mouse milk
Could someone help me with this? Thanks a lot <3
EDIT:
Thanks for your help. Here is the solution
Code
val seqTermTermRDD: RDD[Seq[(String, String)]] = ...
val termTermRDD: RDD[(String, String)] = seqTermTermRDD.flatMap(identity)
val combinedTermsRDD: RDD[String] = termTermRDD.map{ case(term1, term2) => term1 + " " + term2 }
combinedTermsRDD.saveAsTextFile(outputFile)
RDDs have a neat function called "flatMap" that will do exactly what you want. Think of it as a Map followed by a Flatten (except implemented a little more intelligently)--if the function produces multiple entities, each will be added to the group separately. (You can also use this for many other objects in Scala.)
val seqRDD = sc.parallelize(Seq(Seq(("dog","cat"),("cat","mouse"),("mouse","milk"))),1)
val tupleRDD = seqRDD.flatMap(identity)
tupleRDD.collect //Array((dog,cat), (cat,mouse), (mouse,milk))
Note that I also use the scala feature identity, because flatMap is looking for a function that turns an object of the RDD's type to a TraversableOnce, which a Seq counts as.
You can also use mkString( sep ) function ( where sep is for separator) on Scala collections. Here are some examples: (note that in your code you would replace the last .collect().mkString("\n") with saveAsTextFile(filepath)) to save to Hadoop.
scala> val rdd = sc.parallelize(Seq( Seq(("a", "b"), ("c", "d")), Seq( ("1", "2"), ("3", "4") ) ))
rdd: org.apache.spark.rdd.RDD[Seq[(String, String)]] = ParallelCollectionRDD[6102] at parallelize at <console>:71
scala> rdd.map( _.mkString("\n")) .collect().mkString("\n")
res307: String =
(a,b)
(c,d)
(1,2)
(3,4)
scala> rdd.map( _.mkString("|")) .collect().mkString("\n")
res308: String =
(a,b)|(c,d)
(1,2)|(3,4)
scala> rdd.map( _.mkString("\n")).map(_.replace("(", "").replace(")", "").replace(",", " ")) .collect().mkString("\n")
res309: String =
a b
c d
1 2
3 4