Using KeyValueGroupedDataset cogroup in spark - scala

I would like to use cogroup method on KeyValueGroupedDataset in spark. Here is a scala attempt but getting an error:
import org.apache.spark.sql.functions._
val x1 = Seq(("a", 36), ("b", 33), ("c", 40), ("a", 38), ("c", 39)).toDS
val g1 = x1.groupByKey(_._1)
val x2 = Seq(("a", "ali"), ("b", "bob"), ("c", "celine"), ("a", "amin"), ("c", "cecile")).toDS
val g2 = x2.groupByKey(_._1)
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
But getting an error:
<console>:34: error: overloaded method value cogroup with alternatives:
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U], f: org.apache.spark.api.java.function.CoGroupFunction[String,(String, Int),U,R], encoder: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R] <and>
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U])(f: (String, Iterator[(String, Int)], Iterator[U]) => TraversableOnce[R])(implicit evidence$11: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R]
cannot be applied to (org.apache.spark.sql.KeyValueGroupedDataset[String,(String, String)], (Long, Iterator[(String, Int)], Iterator[(String, String)]) => Iterator[(String, Int)])
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
I am getting same error in JAVA.

cogroup you are trying to use is curried so you have to call it separately for the dataset and the function. There is also type mismatch in the key type:
g1.cogroup(g2)(
(k: String, it1: Iterator[(String, Int)], it2: Iterator[(String, String)]) =>
it1)
or just:
g1.cogroup(g2)((_, it1, _) => it1)
In Java, I'd use CoGroupFunction variant:
import org.apache.spark.api.java.function.CoGroupFunction;
import org.apache.spark.sql.Encoders;
g1.cogroup(
g2,
(CoGroupFunction<String, Tuple2<String, Integer>, Tuple2<String, String>, Tuple2<String, Integer>>) (key, it1, it2) -> it1,
Encoders.tuple(Encoders.STRING(), Encoders.INT()));
where g1 and g2 are KeyValueGroupedDataset<String, Tuple2<String, Integer> and KeyValueGroupedDataset<String, Tuple2<String, String>> respectively.

Related

Query related to combineByKey

For the following input => [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
after processing with combineByKey I am expecting the below output
Expected output => [('A', [(3, 9), (4, 16), (5, 25)]), ('B', [(1, 1), (2, 4)])]
scala> val x = sc.parallelize(Array(('B',1),('B',2),('A',3),('A',4),('A',5)))
x: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24
scala> def createCombiner (element:Int) :String = (element.toString + "," + Math.pow(element,2).toInt)
createCombiner: (element: Int)String
scala> def mergeValue (accumlator:String, element:Int) : String = (accumlator + (element.toString + Math.pow(element,2).toInt))
mergeValue: (accumlator: String, element: Int)String
scala> def mergeComb (accumlator:String ,accumlator1:String):String = (accumlator + accumlator1)
mergeComb: (accumlator: String, accumlator1: String)String
scala> val combRDD = x.map(t => (t._1, (t._2))).combineByKey(createCombiner, mergeValue, mergeComb)
combRDD: org.apache.spark.rdd.RDD[(Char, String)] = ShuffledRDD[48] at combineByKey at <console>:31
scala> combRDD.collect
res39: Array[(Char, String)] = Array((A,3,94,165,25), (B,1,12,4))
I am not able to get the expected output. As, I am very new to spark I need some input on this.
What about:
scala> val x = sc.parallelize(Array(('B',1),('B',2),('A',3),('A',4),('A',5)))
scala> def createCombiner(element:Int) : List[(Int, Int)] = List(element -> element * element)
scala> def mergeValue (accumulator: List[(Int, Int)], element:Int) : List[(Int, Int)] = accumulator ++ createCombiner(element)
scala> def mergeComb (accumulator: List[(Int, Int)], accumulator1: List[(Int, Int)]): List[(Int, Int)] = (accumulator ++ accumulator1)
scala> val combRDD = x.combineByKey(createCombiner, mergeValue, mergeComb)
scala> combRDD.collect
// res0: Array[(Char, List[(Int, Int)])] = Array((A,List((3,9), (4,16), (5,25))), (B,List((1,1), (2,4))))
// Or
scala> combRDD.mapValues(_.mkString("[", ", ", "]")).collect
res1: Array[(Char, String)] = Array((A,[(3,9), (4,16), (5,25)]), (B,[(1,1), (2,4)]))

How to access to a value of a scala Tuples

I have a sequence of tuples that with a value and his power 2:
val fields3: Seq[(Int, Int)] = Seq((3, 9), (5, 25))
the thing that I want to know is if there is a way to access to a value of the same tuple directly when I create the object whithout use a foreach:
val fields3: Seq[(Int, Int)] = Seq((3, 3 * 3 ), (5, 5 * 5))
my idea is something like:
val fields3: Seq[(Int, Int)] = Seq((3, _1 * _1 ), (5, _1 * _1)) //like this doesn't compile
You can do something like this:
Seq(2,3,4).map(i => (i, i*i))
You could wrap the tuple in a case class potentially:
case class TupleInt(base: Int) {
val tuple: (Int, Int) = (base, base*base)
}
Then you could create the sequence like this:
val fields3: Seq[(Int, Int)] = Seq(TupleInt(3), TupleInt(5)).map(_.tuple)
I would prefer the answer #geek94 gave, this is too verbose for what you want to do.
An equally valid way to express this is:
val fields3: Seq[(Int, Int)] = Seq(3, 5).map(i => i -> i*i)

Map and Split the data based on key in Spark Scala

How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance
You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Flatmap scala [String, String,List[String]]

I have this prbolem, I have an RDD[(String,String, List[String]), and I would like to "flatmap" it to obtain a RDD[(String,String, String)]:
e.g:
val x :RDD[(String,String, List[String]) =
RDD[(a,b, list[ "ra", "re", "ri"])]
I would like get:
val result: RDD[(String,String,String)] =
RDD[(a, b, ra),(a, b, re),(a, b, ri)])]
Use flatMap:
val rdd = sc.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
// rdd: org.apache.spark.rdd.RDD[(String, String, List[String])] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd.flatMap{ case (x, y, z) => z.map((x, y, _)) }.collect
// res23: Array[(String, String, String)] = Array((a,b,ra), (a,b,re), (a,b,ri))
This is an alternative way of doing it using flatMap again
val rdd = sparkContext.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
rdd.flatMap(array => array._3.map(list => (array._1, array._2, list))).foreach(println)

Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?

Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int) to DataFrame but not of type scala.collection.immutable.Map[Int,Int]. Why is that?
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
If you take a look at the org.apache.spark.sql.SQLContext source you'll find two different implementations of the createDataFrame method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
and
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
As you can see both require A to be a subclass of Product. When you call toDF on a RDD[(Map[Int,Int], Int)] it works because Tuple2 is indeed a Product. Map[Int,Int] by itself is not hence the error.
You can make it work by wrapping Map with Tuple1:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
Basically because there is no implicit to create a DataFrame for a Map inside an RDD.
In you first example you are returning a Tuple, which is a Product for which there is an implicit conversion.
rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A])
In the second example you use have a Map in your RDD, for which there is no implicit conversion.