Array[Byte] Spark RDD to String Spark RDD - scala

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?

First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))

I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.

Related

how to convert a RDD to other RDD using case class property?

i have an RDD as below with name: other_nodes:
(4,(1,true))
(22,(1,true))
(14,(1,true))
(3,(1,true))
(8,(1,true))
(18,(1,true))
i wrote a case class as below and applyed it on a graph and it gave the result i wanted:
case class nodes_properties(label:Int, isVisited:Boolean=false)
when i apply case on a graph its result looks like this:
(1,nodes_properties(15,false))
(2,nodes_properties(11,false))
(3,nodes_properties(9,false))
Problem: how can i apply the case class i have defined, on the other_nodes RDD to get the result like as below:
(4,nodes_properties(1,true))
(22,nodes_properties(1,true))
(14,nodes_properties(1,true))
(3,nodes_properties(1,true))
(8,nodes_properties(1,true))
(18,nodes_properties(1,true))
This solution might work:
scala> val data = sc.parallelize(Seq((4,(1, true)),(22,(1,true))))
data: org.apache.spark.rdd.RDD[(Int, (Int, Boolean))] = ParallelCollectionRDD[72] at parallelize at <console>:39
scala> data.take(2)
res27: Array[(Int, (Int, Boolean))] = Array((4,(1,true)), (22,(1,true)))
scala> val data1 = data.map(elem => (elem._1, nodes_properties(elem._2._1, elem._2._2)))
data1: org.apache.spark.rdd.RDD[(Int, nodes_properties)] = MapPartitionsRDD[73] at map at <console>:42
scala> data1.take(2)
res28: Array[(Int, nodes_properties)] = Array((4,nodes_properties(1,true)), (22,nodes_properties(1,true)))
EDIT
The problem is each element in others_rdd is of Type (VertexId, Any). You need to convert to (VertexId, (Int, Boolean)) type in order for your case class to apply. The way to do is
val newRdd = others_rdd.map(elem => (elem._1, elem._2.asInstanceOf[(Int,Boolean)]))
After performing this, you can apply the solution as shown above by mapping to node_properties class.
Let me know if it helps!!

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)

Make .csv files from arrays in Scala

I have two tuples in Scala of the following form:
val array1 = (bucket1, Seq((dateA, Amount11), (dateB, Amount12), (dateC, Amount13)))
val array2 = (bucket2, Seq((dateA, Amount21), (dateB, Amount22), (dateC, Amount23)))
What is the quickest way to make a .csv file in Scala such that:
date* is pivot.
bucket* is column name.
Amount* fill the table.
It needs to look something like this:
Dates______________bucket1__________bucket2
dateA______________Amount11________Amount21
dateB______________Amount12________Amount22
dateC______________Amount13________Amount23
You can make it shorter by chaining some operations, but :
scala> val array1 = ("bucket1", Seq(("dateA", "Amount11"), ("dateB", "Amount12"), ("dateC", "Amount13")))
array1: (String, Seq[(String, String)]) =
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)))
scala> val array2 = ("bucket2", Seq(("dateA", "Amount21"), ("dateB", "Amount22"), ("dateC", "Amount23")))
array2: (String, Seq[(String, String)]) =
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
// Single array to work with
scala> val arrays = List(array1, array2)
arrays: List[(String, Seq[(String, String)])] = List(
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13))),
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
)
// Split between buckets and the values
scala> val (buckets, values) = arrays.unzip
buckets: List[String] = List(bucket1, bucket2)
values: List[Seq[(String, String)]] = List(
List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)),
List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23))
)
// Format the data
// Note that this does not keep the 'dateX' order
scala> val grouped = values.flatten
.groupBy(_._1)
.map { case (date, list) => date::(list.map(_._2)) }
grouped: scala.collection.immutable.Iterable[List[String]] = List(
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join everything, and add the "Dates" column in front of the buckets
scala> val table = ("Dates"::buckets)::grouped.toList
table: List[List[String]] = List(
List(Dates, bucket1, bucket2),
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join the rows by ',' and the lines by "\n"
scala> val string = table.map(_.mkString(",")).mkString("\n")
string: String =
Dates,bucket1,bucket2
dateC,Amount13,Amount23
dateB,Amount12,Amount22
dateA,Amount11,Amount21