Efficient way to convert Dataframe to RDD in Scala/SPARK? - scala

I have a dataFrame = [CUSTOMER_ID ,itemType, eventTimeStamp, valueType] which I convert to RDD[(String, (String, String, Map[String, Int]))] by doing the following:
val tempFile = result.map( {
r => {
val customerId = r.getAs[String]( "CUSTOMER_ID" )
val itemType = r.getAs[String]( "itemType" )
val eventTimeStamp = r.getAs[String]( "eventTimeStamp" )
val valueType = r.getAs[Map[String, Int]]( "valueType" )
(customerId, (itemType, eventTimeStamp, valueType))
}
} )
Since my my input is huge, this takes much time. Is there any efficient way to convert the df to RDD[(String, (String, String, Map[String, Int]))] ?

The operation you've described is as cheap as it's going to get. Doing a few getAs and allocating a few tuples is almost free. If it's going slow, that's probably unavoidable due to your large data size (7T). Also note that Catalyst optimizations cannot be performed on RDDs, so including this kind of .map downstream of DataFrame operations will often prevent other Spark shortcuts.

Related

Eliminate for loops in Spark using Scala

I am having a scenario in which. I iterate over a list of DataFrames. Perform same type of operation on each using a FOR LOOP, and store the transformed data frame in a Map(String -> DataFrame).
for (df <- dfList)
{
//perform some transformation of dataframe
dfMap = dfMap + ("some_name", df)
}
This solution is working fine. But in a sequential manner. I want to make use of async to achieve parallelism and performance improvements. Such that the transformations on each df occur parallelly making using of distributed processing capabilities of Spark.
Check below code.
def logic(df: DataFrame):Map[String,DataFrame] = {
// Return Map[String,DataFrame]
}
val dfa = // DataFrame 1
val dfb = // DataFrame 2
val dfc = // DataFrame 3
Seq(dfa,dfb,dfc,dfd)
.par // Parallel
.map(logic) // invoking logic function for every dataframe.
.reduce( _ ++ _ ) // Final result in Map["aaa" -> dfa,"bbb" -> dfb,"ccc" -> dfc]
Update
def writeToMap(a: Int, i: Int) = Map(a -> i)
def doOperation(a: Int)=writeToMap(a,a+10)
val list = Seq.range(0, 33)
list.par.map(x => doOperation(x))
val dfList : List[DataFrame] = // Your Dataframe list
val dfMap : Map[String,DataFrame] = dfList.map("some_name" -> _).toMap
.map do the mapping of each element with the Pair
.toMap would aggregate the result to a Map.
Note : some_name should be unique for every dataframe

Use combineByKey to get output as (key, iterable[values])

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.
If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

How to convert two columns from a data frame to Map(col1, col2) in scala?

How to convert rwo columns from a data frame to Map(col1, col2) in scala ?
I tried :
val resultMap = df.select($"col1", $"col2")
.map ({
case Row(a:String, b: String) => Map(a.asInstanceOf[String] ->b.asInstanceOf[String] )
})
But I couldn't able to get the values from this map. Is there any other way to do this ?
There is no Dataset Encoder for Map[String, String], I'm not even sure you can actually make one at all.
Here are two versions, one which is unsafe and other which is safe, to do what you want to do. Effectively you'll need to reduce to RDD level to do the computation:
case class OnFrame(df: DataFrame) {
import df.sparkSession.implicits._
/**
* If input columns don't match we'll fail at query evaluation.
*/
def unsafeRDDMap: RDD[Map[String, String]] = {
df.rdd.map(row => Map(row.getAs[String]("col1") -> row.getAs[String]("col2")))
}
/**
* Use Dataset-to-case-class mapping.
* If input columns don't match we'll fail before query evaluation.
*/
def safeRDDMap: RDD[Map[String, String]] = {
df
.select($"col1" as "key", $"col2" as "value")
.as[OnFrame.Entry]
.rdd
.map(_.toMap)
}
def unsafeMap(): Map[String, String] = {
unsafeRDDMap.reduce(_ ++ _)
}
def safeMap(): Map[String, String] = {
safeRDDMap.reduce(_ ++ _)
}
}
If you provide more clearly what your goal is perhaps we could this even more efficiently as collecting everything into a single map is a potential Spark anti-pattern - meaning your data fits into the driver.

removing layers from a joined RDD and giving names to the elements

I am working on a sequence of joins between RDDs and after few joins it really gets confusing accessing each element from index. The below is a joined RDD for me. This is just a simple example. Actually it might get more ugly.
res41: org.apache.spark.rdd.RDD[(String, ((String, Double), Double))]
Can I :
Give names to each of these elements in the RDD and then access them ?
remove the layers and get all the elements flattened as comma separated values ? I know flatMap might help but don't know how to use that.
Any help will be appreciated
You don't mention which programming language you are using, but in Scala you could flatten and name your fields by declaring a case class and mapping your RDD to it:
val conf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(conf)
val data = List(
("abc", (("x", 12.3), 23.4)),
("def", (("y", 22.3), 24.4)),
("jkl", (("z", 32.3), 25.4))
)
val rdd = sc.parallelize(data)
case class MyDataStructure(field1: String, field2: String, field3: Double, field4: Double)
val caseRDD = data.map {
case (f1, ((f2, f3), f4)) => MyDataStructure(f1, f2 , f3, f4)
}
caseRDD //has type RDD[MyDataStructure]