Spark: Extract Values from Output RDD - scala

I am new in Spark programming. I am trying to extract values from RDD as I got the below output from RDD
(CBI10006,(Some(Himanshu Vasani),None))
(CBI10004,(Some(Sonam Petro),Some(8500)))
(CBI10003,(None,Some(3000)))
And I want to extract above value to below one
(CBI10006,Himanshu Vasani,'')
(CBI10004,Sonam Petro,8500)
(CBI10003,'',3000)
And I have tried FlatMap approch as below
joined.flatMap{case(f1,f2) => (f1,(f2._1,f2._2))} but getting a below error
type mismatch;
found : (String, (Option[String], Option[String]))
required: TraversableOnce[?]
**joined.flatMap{case(f1,f2) => (f1,(f2._1,f2._2))}**

Using map():
val data = Seq(("CBI10006", (Some("Himanshu Vasani"), None)), ("CBI10004", (Some("Sonam Petro"), Some(8500))),
("CBI10003", (None, Some(3000))))
spark.sparkContext
.parallelize(data)
.map { case (x, y) => (x, y._1.getOrElse(""), y._2.getOrElse("")) }
.foreach(println)
// output:
// (CBI10006,Himanshu Vasani,)
// (CBI10004,Sonam Petro,8500)
// (CBI10003,,3000)

Related

How to convert var to List?

How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))

Scala: GraphX: error: class Array takes type parameters

I am trying to build an Edge RDD for GraphX. I am reading a csv file and converting to DataFrame Then trying to convert to an Edge RDD:
val staticDataFrame = spark.
read.
option("header", true).
option("inferSchema", true).
csv("/projects/pdw/aiw_test/aiw/haris/Customers_DDSW-withDN$.csv")
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] =
staticDataFrame.select(
"dealer_customer_number",
"parent_dealer_cust_number",
"dealer_code"
).map{ (row: Array) =>
Edge((
row.getAs[Long]("dealer_customer_number"),
row.getAs[Long]("parent_dealer_cust_number"),
row("dealer_code")
))
}
But I am getting this error:
<console>:81: error: class Array takes type parameters
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] = staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").map((row: Array) => Edge((row.getAs[Long]("dealer_customer_number"), row.getAs[Long]("parent_dealer_cust_number"), row("dealer_code"))))
^
The result for
staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").take(1)
is
res3: Array[org.apache.spark.sql.Row] = Array([0000101,null,B110])
First, Array takes type parameters, so you would have to write Array[Something]. But this is probably not what you want anyway.
The dataframe is a Dataset[Row], not a Dataset[Array[_]], therefore you have to change
.map{ (row: Array) =>
to
.map{ (row: Row) =>
Or just omit the typing completely (it should be inferred):
.map{ row =>

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Spark: Using mapPartition with Scala

Lets say I am having the following dataframe:
var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()
and I am having this function which will be an input for the mapPartition:
def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
row.toArray.tail.toIterator
And using the map partition:
df.mapPartition(trialIterator)
I am having the following error message:
Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]
I can understand that this is happening due to the input, output type of my function but how to solve this?
If you want to get strongly typed input don't use Dataset[Row] (DataFrame) but Dataset[T] where T in this particular scenario is (String, Int). Also don't convert to Array and don't call blindly tail without knowing if partition is empty:
def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)
randomData
.toDS // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
or
randomData.toDF // org.apache.spark.sql.Dataset[Row]
.as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]
def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
row.next()
row //seems to do the same thing w/o all the conversions
}