Spark Join Single Dataframe to a Collection of Dataframes - scala

I'm struggling to figure out an elegant solution to join a single dataframe to a separate sequence of 1 to N related dataframes. Initial attempt:
val sources = program.attributes.map(attr => {
spark.read
.option("header", value = true)
.schema(program.GetSchema(attr))
.csv(s"${program.programRawHdfsDirectory}/${attr.sourceFile}")
})
val rawDf: DataFrame = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
// Full of fail:
val fullDf: DataFrame = program.dimensions.filter(d => d.hierarchy != "RAW").reduceLeft((d1, _) => {
val hierarchy = spark.read.parquet(d1.hierarchyLocation).where(d1.hierarchyFilter)
rawDf.join(hierarchy, d1.hierarchyJoin)
})
fullDf.selectExpr(program.outputFields:_*).write.parquet(program.programEtlHdfsDirectory)
The reduceLeft idea doesn't work because I'm iterating through a collection of configuration objects (the dimensions property), but what I want returned from each iteration is a dataframe. The error is a type mismatch, which is not surprising.
The core of the problem is that I have 1 to N "dimension" objects that define how to load an existing hierarchy table and also how to join that table to my "raw" dataframe I created earlier.
Any idea how I might create these joins without some sort of horrible hack?
UPDATE:
I wonder is this might work? I have a common field name in each hierarchy dataframe that I'm joining to. If I renamed this common field to match the corresponding column in my "raw" dataframe, could I execute the joins in a fold without explicitly calling out the columns? Will Spark just default to the matching names?
val rawDf = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
val hierarchies = program.dimensions.map(dim => {
spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).withColumnRenamed("parent_hier_cd", dim.columnName)
})
val fullDf = hierarchies.foldLeft(rawDf) { (df1, df2) => df1.join(df2) }
UPDATE 2
No, that does not work. Spark attempts a cross join.

For my purposes, I simply needed to return a tuple when generating the collection of hierarchies:
val hierarchies = program.dimensions.map(dim => {
val hierarchy = spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).alias(dim.hierarchy.toLowerCase)
(dim, hierarchy)
})
Then when I fold them into rawDf, I have the metadata I need to construct the joins.

Related

How to add columns to df with StructField Array?

I have two dataframes, and I want to add to the first of them all the columns that are in the second, but not in the first. I got an array of StructField columns that I want to add to the dataframe, and fill with nulls.
That's the best I've come up with:
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField]): DataFrame = {
val spark = df.sparkSession
val schema = new StructType(df.schema.toArray ++ columnsToAdd)
spark.createDataFrame(df.rdd, schema)
}
Is there any better way?
My solution that I gave in the question unfortunately does not work. Crashes with the error java.lang.ArrayIndexOutOfBoundsException. As I understand it, the fact is that even though I added columns to the schema, they were not added to the dataframe, spark is trying to access the next data frame field, which is in the schema, but not in the real data.
I wrote such a variant, it uses recursion and does what I want. Although of course I would like to abandon the use of null, and somehow replace it with None.
#tailrec
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField], indx: Int): DataFrame = {
if(columnsToAdd.length == indx || columnsToAdd.isEmpty) df
else {
val dfWithColumn = df.withColumn(columnsToAdd(indx).name, lit(null).cast(columnsToAdd(indx).dataType))
addColumns(dfWithColumn, columnsToAdd, indx + 1)
}
}
Also this answer helped a lot.

How to handle missing columns in spark sql

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

Get Json Key value from List[Row] with Scala

Let's say that I have a List[Row] such as {"name":"abc,"salary","somenumber","id":"1"},{"name":"xyz","salary":"some_number_2","id":"2"}
How do I get the JSON key value pair with scala. Let's assume that I want to get the value of the key "salary". IS the below one right ?
val rows = List[Row] //Assuming that rows has the list of rows
for(row <- rows){
row.get(0).+("salary")
}
If you have a List[Row] I assume that you've had a DataFrame and you did collectAsList. If you collect/collectAsList that means that you
Can no longer use that Spark SQL operations
Can not run your calculations in parallel on the nodes in your cluster. At this point everything is executed in your driver.
I would recommend keeping it as a DataFrame and then doing:
val salaries = df.select("salary")
Then you can do further calculations on the salaries, show them or collect or persist them somewhere.
If you choose to use DataSet (which is like a typed DataFrame) then you could do
val salaries = dataSet.map(_.salary)
Using Spray Json:
import spray.json._
import DefaultJsonProtocol._
object sprayApp extends App {
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""", """{"name":"xyz","salary":"some_number_2","id":"2"}""")
val jsonAst = list.map(_.parseJson)
for(l <- jsonAst) {
println(l.asJsObject.getFields("salary")(0))
}
}

Dropping columns by data type in Scala Spark

df1.printSchema() prints out the column names and the data type that they possess.
df1.drop($"colName") will drop columns by their name.
Is there a way to adapt this command to drop by the data-type instead?
If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.
import sqlContext.implicits._
val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")
df.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.drop(field)})
The schema of the newDf is org.apache.spark.sql.DataFrame = [c2: int]
Here is a fancy way in scala:
var categoricalFeatColNames = df.schema.fields filter { _.dataType.isInstanceOf[org.apache.spark.sql.types.StringType] } map { _.name }

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either