Should I cache or not my unified dataframes? - scala

I am not familiar with caching in Spark.
I need to do multiple DF unions inside a loop. each union adds few million lines. Should I df.cache my result after each union?
var DB_List = List ("Database1", "Database2", "Database3", "Database4", "Database5", "Database6", "Database7", "Database8", "Database9", "Database10")
var df = getDF(spark, DB_List(0)) // this returns a DF.
for(i <- 1 until DB_List.length){
df = df.union(getDF(spark, DB_List(i)))
//df.cache or not?
}
//Here, I use df.repartition(1) to write resulted DF in a CSV file.

you don't need to cache the intermediate result but only the final one.
instead of for loop you can use fold:
val dfs = DB_List.map(getDF(spark, _))
val result = dfs.reduce(_ union _)

Related

Adding elements to ArrayBuffer in from a for loop in scala

import scala.collection.mutable.ArrayBuffer
spark.sql("set db=test_script")
spark.sql("set table=member_test")
val colDF = sql("show columns from ${table} from ${db}")
var tempArray = new ArrayBuffer[String]()
var temp
colDF.foreach { row => row.toSeq.foreach { col =>
temp = "count(case when "+ col+ " ='X' then 1 else NULL END) AS count"+ col
tempArray += temp
}}
println(tempArray) // getting empty array
println(temp) // getting blank string
Hi, I am new to scala programming. I am trying to loop through a dataframe and append the formatted String data to my ArrayBuffer.
When I put the print statement inside the for loop, everything, seems to be fine, whereas If i try to access the arrayBuffer outside the loop, its empty.
Is it something related to the scope of the variable?
I am using arrayBuffer, because I got to know that list is mutable in Scala.
Please suggest any better way if you have.
Thanks in advance
The issue you are having is that spark is a distributed system, which means copies of your buffer are sent to each executor (And not returned back to the driver), hence why it is empty.
Also note that colDF is a DataFrame. This means that when you do
row => row.toSeq
The result of this is an Array(Any) (this isn't good practice). A better way of doing this would be:
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val sqlStatement = columns.map(c => s"count(case when $c = 'X' then 1 else NULL END) as count$c")
However, even better is not to use SQL at all and use Spark!
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val selectStatement: List[Column] = columns.map{ c =>
count(when(col(c) === "X", lit(1)).as(s"count$c")
}.toList
dataFrame.select(selectStatement :_*)

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

Can I recursively apply transformations to a Spark dataframe in scala?

Noodling around with Spark, using union to build up a suitably large test dataset. This works OK:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.union(df).union(df).count()
But I'd like to do something like this:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
for (a <- 1 until 10){
df = df.union(df)
}
that barfs with error
<console>:27: error: reassignment to val
df = df.union(df)
^
I know this technique would work using python, but this is my first time using scala so I'm unsure of the syntax.
How can I recursively union a dataframe with itself n times?
If you use val on the dataset it becomes an immutable variable. That means you can't do any reassignments. If you change your definition to var df your code should work.
A functional approach without mutable data is:
val df = List(1,2,3,4,5).toDF
val bigDf = ( for (a <- 1 until 10) yield df ) reduce (_ union _)
The for loop will create a IndexedSeq of the specified length containing your DataFrame and the reduce function will take the first DataFrame union it with the second and will start again using the result.
Even shorter without the for loop:
val df = List(1,2,3,4,5).toDF
val bigDf = 1 until 10 map (_ => df) reduce (_ union _)
You could also do this with tail recursion using an arbitrary range:
#tailrec
def bigUnion(rng: Range, df: DataFrame): DataFrame = {
if (rng.isEmpty) df
else bigUnion(rng.tail, df.union(df))
}
val resultingBigDF = bigUnion(1.to(10), myDataFrame)
Please note this is untested code based on a similar things I had done.

How to add corresponding Integer values in 2 different DataFrames

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}