Dynamic Mapping Statement in Spark Scala - scala

In my code I change a DF into an RDD to run a function through a map call. With each run there is a couple seconds of overhead so if I run the function call like this:
var iterator = 1
val inputDF = spark.sql("select * from DF")
var columnPosition = Array("Column_4")
columnPosition = columnPosition ++ Array("Column_9")
var selectedDF = inputDF
var intrimDF = inputDF
var finalDF = inputDF
val inputDF = spark.sql("select * from DF")
while(iterator <= columnPosition.length) {
selectedDF=finalDF.selectExpr("foreign_key",columnPosition(iterator))
intrimDF = selectedDF.rdd.map(x => (x(0),action(x(1)))).toDF.selectExpr("_1 as some_key","_2 as " columnPosition(iterator)).joinBack.changeColumnPositions.dropOriginalColumn.renameColumns
finalDF=intrimDF
iterator = iterator + 1
}
It runs in ~90 seconds for a large job, because of the join. What I am trying to do is build it like below to cut out the join entirely and have be dynamic.
val inputDF = spark.sql("select * from DF")
val intrimDF = inputDF.rdd.map(x=>(x(0),x(1),x(2),action(x(3)),x(4),x(5),x(6),x(7),action(x(8))))
val columnStatement//Create an Array with columnName changes
val finalDF = intrimDF.selectExpr(columnStatement :_*)
The issue is I cant get past the hard coding side the problem, below is an example of what I want to try to do by dynamically setting the mapping call.
val mappingStatement = "x=>(x(0),x(1),x(2),action(x(3)),x(4),x(5),x(6),x(7),action(x(8)))"
val intrimDF = inputDF.rdd.map(mappingStatement)
Everything I have tried failed:
1 Calling using the Map() function
2 Setting an Array and passing it as :_*
3 Trying to build it as an calling, but it doesnt like being dyanmic
Hope this makes sense!

Related

Adding elements to ArrayBuffer in from a for loop in scala

import scala.collection.mutable.ArrayBuffer
spark.sql("set db=test_script")
spark.sql("set table=member_test")
val colDF = sql("show columns from ${table} from ${db}")
var tempArray = new ArrayBuffer[String]()
var temp
colDF.foreach { row => row.toSeq.foreach { col =>
temp = "count(case when "+ col+ " ='X' then 1 else NULL END) AS count"+ col
tempArray += temp
}}
println(tempArray) // getting empty array
println(temp) // getting blank string
Hi, I am new to scala programming. I am trying to loop through a dataframe and append the formatted String data to my ArrayBuffer.
When I put the print statement inside the for loop, everything, seems to be fine, whereas If i try to access the arrayBuffer outside the loop, its empty.
Is it something related to the scope of the variable?
I am using arrayBuffer, because I got to know that list is mutable in Scala.
Please suggest any better way if you have.
Thanks in advance
The issue you are having is that spark is a distributed system, which means copies of your buffer are sent to each executor (And not returned back to the driver), hence why it is empty.
Also note that colDF is a DataFrame. This means that when you do
row => row.toSeq
The result of this is an Array(Any) (this isn't good practice). A better way of doing this would be:
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val sqlStatement = columns.map(c => s"count(case when $c = 'X' then 1 else NULL END) as count$c")
However, even better is not to use SQL at all and use Spark!
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val selectStatement: List[Column] = columns.map{ c =>
count(when(col(c) === "X", lit(1)).as(s"count$c")
}.toList
dataFrame.select(selectStatement :_*)

Should I cache or not my unified dataframes?

I am not familiar with caching in Spark.
I need to do multiple DF unions inside a loop. each union adds few million lines. Should I df.cache my result after each union?
var DB_List = List ("Database1", "Database2", "Database3", "Database4", "Database5", "Database6", "Database7", "Database8", "Database9", "Database10")
var df = getDF(spark, DB_List(0)) // this returns a DF.
for(i <- 1 until DB_List.length){
df = df.union(getDF(spark, DB_List(i)))
//df.cache or not?
}
//Here, I use df.repartition(1) to write resulted DF in a CSV file.
you don't need to cache the intermediate result but only the final one.
instead of for loop you can use fold:
val dfs = DB_List.map(getDF(spark, _))
val result = dfs.reduce(_ union _)

Can I recursively apply transformations to a Spark dataframe in scala?

Noodling around with Spark, using union to build up a suitably large test dataset. This works OK:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.union(df).union(df).count()
But I'd like to do something like this:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
for (a <- 1 until 10){
df = df.union(df)
}
that barfs with error
<console>:27: error: reassignment to val
df = df.union(df)
^
I know this technique would work using python, but this is my first time using scala so I'm unsure of the syntax.
How can I recursively union a dataframe with itself n times?
If you use val on the dataset it becomes an immutable variable. That means you can't do any reassignments. If you change your definition to var df your code should work.
A functional approach without mutable data is:
val df = List(1,2,3,4,5).toDF
val bigDf = ( for (a <- 1 until 10) yield df ) reduce (_ union _)
The for loop will create a IndexedSeq of the specified length containing your DataFrame and the reduce function will take the first DataFrame union it with the second and will start again using the result.
Even shorter without the for loop:
val df = List(1,2,3,4,5).toDF
val bigDf = 1 until 10 map (_ => df) reduce (_ union _)
You could also do this with tail recursion using an arbitrary range:
#tailrec
def bigUnion(rng: Range, df: DataFrame): DataFrame = {
if (rng.isEmpty) df
else bigUnion(rng.tail, df.union(df))
}
val resultingBigDF = bigUnion(1.to(10), myDataFrame)
Please note this is untested code based on a similar things I had done.

Dropping constant columns in a csv file

i would like to drop columns which are constant in a dataframe , here what i did , but i see that it tooks some much time to do it , special while writing the dataframe into the csv file , please any help to optimize the code to take less time
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
// select columns with stddev != 0
val finalResult = df.select(columnsToKeep.head, columnsToKeep.tail : _*)
finalResult.write.option("header",true).csv("D:\\ProcessDataSet\\dataWithoutConstant\\Set _1Mud Pumps_MergedCleaned.csv")
}
I think there is no much room left for optimization. You are doing the right thing.
Maybe what you can try is to cache() your dataframe df.
df is used in two separate Spark actions so it is loaded twice.
Try :
...
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
df.cache()
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
...

Dynamically create dataframes in scala

I am new to scala and spark. I have a requirement to create the dataframes dynamically by reading a file. each line of a file is a query. at last join all dataframes and store the result in a file.
I wrote below basic code, having trouble to dynamically create the dataframes.
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql.SQLConf
import scala.io.Source
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val empFile = "/user/sri/sample2.txt"
sqlContext.load("com.databricks.spark.csv", Map("path" -> empFile, "header" -> "true")).registerTempTable("emp")
var cnt=0;
val filename ="emp.sql"
for (line <- Source.fromFile(filename).getLines)
{
println(line)
cnt += 1
//var dis: String = "emp"+cnt
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
println(dis)
//val dis = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
println(cnt)
exit
Please help me, suggest me if it can be done better way
What error are you getting? I assume you're code won't compile, considering this line:
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
In Scala, defining the name of some construct (val "emp"+cnt in your case) programatically is not easily possible.
In your case, you could use a collection to hold the results.
val queries = for (line <- Source.fromFile(filename).getLines) yield {
sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
val cnt = queries.length