how to convert a RDD to other RDD using case class property? - scala

i have an RDD as below with name: other_nodes:
(4,(1,true))
(22,(1,true))
(14,(1,true))
(3,(1,true))
(8,(1,true))
(18,(1,true))
i wrote a case class as below and applyed it on a graph and it gave the result i wanted:
case class nodes_properties(label:Int, isVisited:Boolean=false)
when i apply case on a graph its result looks like this:
(1,nodes_properties(15,false))
(2,nodes_properties(11,false))
(3,nodes_properties(9,false))
Problem: how can i apply the case class i have defined, on the other_nodes RDD to get the result like as below:
(4,nodes_properties(1,true))
(22,nodes_properties(1,true))
(14,nodes_properties(1,true))
(3,nodes_properties(1,true))
(8,nodes_properties(1,true))
(18,nodes_properties(1,true))

This solution might work:
scala> val data = sc.parallelize(Seq((4,(1, true)),(22,(1,true))))
data: org.apache.spark.rdd.RDD[(Int, (Int, Boolean))] = ParallelCollectionRDD[72] at parallelize at <console>:39
scala> data.take(2)
res27: Array[(Int, (Int, Boolean))] = Array((4,(1,true)), (22,(1,true)))
scala> val data1 = data.map(elem => (elem._1, nodes_properties(elem._2._1, elem._2._2)))
data1: org.apache.spark.rdd.RDD[(Int, nodes_properties)] = MapPartitionsRDD[73] at map at <console>:42
scala> data1.take(2)
res28: Array[(Int, nodes_properties)] = Array((4,nodes_properties(1,true)), (22,nodes_properties(1,true)))
EDIT
The problem is each element in others_rdd is of Type (VertexId, Any). You need to convert to (VertexId, (Int, Boolean)) type in order for your case class to apply. The way to do is
val newRdd = others_rdd.map(elem => (elem._1, elem._2.asInstanceOf[(Int,Boolean)]))
After performing this, you can apply the solution as shown above by mapping to node_properties class.
Let me know if it helps!!

Related

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Spark: Using mapPartition with Scala

Lets say I am having the following dataframe:
var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()
and I am having this function which will be an input for the mapPartition:
def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
row.toArray.tail.toIterator
And using the map partition:
df.mapPartition(trialIterator)
I am having the following error message:
Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]
I can understand that this is happening due to the input, output type of my function but how to solve this?
If you want to get strongly typed input don't use Dataset[Row] (DataFrame) but Dataset[T] where T in this particular scenario is (String, Int). Also don't convert to Array and don't call blindly tail without knowing if partition is empty:
def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)
randomData
.toDS // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
or
randomData.toDF // org.apache.spark.sql.Dataset[Row]
.as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]
def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
row.next()
row //seems to do the same thing w/o all the conversions
}

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)

How to generate array of tuples from a list in scala?

I have a row as below (Int,List(String))
(22,List(B00000JCDS, B000004CSZ, B00016XN6Q, B00005LLY3, B00023B1UI))
I need to generate a tuple array or any other collection as below:
(22,B00000JCDS)
(22,B000004CSZ)
(22,B00016XN6Q)
(22,B00005LLY3)
(22,B00023B1UI)
How to generate this dataset in Scala?
scala> a._2.map((a._1,_))
res3: List[(Int, String)] = List((22,B00000JCDS), (22,B000004CSZ), (22,B00016XN6Q), (22,B00005LLY3), (22,B00023B1UI))
where a is (22,List(B00000JCDS, B000004CSZ, B00016XN6Q, B00005LLY3, B00023B1UI))
The most readable thing which comes to mind is to use for comprehension:
scala> val g = (22,List("B00000JCDS", "B000004CSZ", "B00016XN6Q", "B00005LLY3", "B00023B1UI"))
g: (Int, List[String]) = (22,List(B00000JCDS, B000004CSZ, B00016XN6Q, B00005LLY3, B00023B1UI))
scala>
| for {
| fromList <- g._2
| } yield (g._1, fromList)
res3: List[(Int, String)] = List((22,B00000JCDS), (22,B000004CSZ), (22,B00016XN6Q), (22,B00005LLY3), (22,B00023B1UI))
If you want an array just call toArray.
Several approaches to this for a given row a, as follows,
Array.fill(a._2.size)(a._1) zip a._2
(Iterator.continually(a._1) zip a._2.iterator) toArray
Array.tabulate(a._2.size)(i => (a._1, a._2(i)) )