How to handle missing columns in spark sql - scala

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest

This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)

Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.

Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.

Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

Related

How to add columns to df with StructField Array?

I have two dataframes, and I want to add to the first of them all the columns that are in the second, but not in the first. I got an array of StructField columns that I want to add to the dataframe, and fill with nulls.
That's the best I've come up with:
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField]): DataFrame = {
val spark = df.sparkSession
val schema = new StructType(df.schema.toArray ++ columnsToAdd)
spark.createDataFrame(df.rdd, schema)
}
Is there any better way?
My solution that I gave in the question unfortunately does not work. Crashes with the error java.lang.ArrayIndexOutOfBoundsException. As I understand it, the fact is that even though I added columns to the schema, they were not added to the dataframe, spark is trying to access the next data frame field, which is in the schema, but not in the real data.
I wrote such a variant, it uses recursion and does what I want. Although of course I would like to abandon the use of null, and somehow replace it with None.
#tailrec
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField], indx: Int): DataFrame = {
if(columnsToAdd.length == indx || columnsToAdd.isEmpty) df
else {
val dfWithColumn = df.withColumn(columnsToAdd(indx).name, lit(null).cast(columnsToAdd(indx).dataType))
addColumns(dfWithColumn, columnsToAdd, indx + 1)
}
}
Also this answer helped a lot.

Spark SQL UDF with multiple arg does not work as expected

I am working in Scala programming language. I want to extract certain fields from DataFrame Column -which has json string. I also have list of collection of StructType which tells which schema to extract from particular row. I need to extract different schema for different rows. The collection of schema will contain many rows but I need to fetch only one schema based on another column (same row) in the DataFrame. Here is my code.
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
.....
private def getSchemaUDF : UserDefinedFunction = udf(getSchema _)
private def getSchema(field: String, schemas : Map[String, StructType]): StructType = {
val schema = schemas.filter(x => x._1 == field)
require(schema.size == 1, s"Multiple schemas found for ${field}")
schema.head._2
}
But I get this error in this line:
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
Error:(127, 83) type mismatch;
found : Map[String,org.apache.spark.sql.types.StructType]
required: org.apache.spark.sql.Column
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
Can someone please tell me how can I fix it? Or how can I achieve this?

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Spark Join Single Dataframe to a Collection of Dataframes

I'm struggling to figure out an elegant solution to join a single dataframe to a separate sequence of 1 to N related dataframes. Initial attempt:
val sources = program.attributes.map(attr => {
spark.read
.option("header", value = true)
.schema(program.GetSchema(attr))
.csv(s"${program.programRawHdfsDirectory}/${attr.sourceFile}")
})
val rawDf: DataFrame = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
// Full of fail:
val fullDf: DataFrame = program.dimensions.filter(d => d.hierarchy != "RAW").reduceLeft((d1, _) => {
val hierarchy = spark.read.parquet(d1.hierarchyLocation).where(d1.hierarchyFilter)
rawDf.join(hierarchy, d1.hierarchyJoin)
})
fullDf.selectExpr(program.outputFields:_*).write.parquet(program.programEtlHdfsDirectory)
The reduceLeft idea doesn't work because I'm iterating through a collection of configuration objects (the dimensions property), but what I want returned from each iteration is a dataframe. The error is a type mismatch, which is not surprising.
The core of the problem is that I have 1 to N "dimension" objects that define how to load an existing hierarchy table and also how to join that table to my "raw" dataframe I created earlier.
Any idea how I might create these joins without some sort of horrible hack?
UPDATE:
I wonder is this might work? I have a common field name in each hierarchy dataframe that I'm joining to. If I renamed this common field to match the corresponding column in my "raw" dataframe, could I execute the joins in a fold without explicitly calling out the columns? Will Spark just default to the matching names?
val rawDf = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
val hierarchies = program.dimensions.map(dim => {
spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).withColumnRenamed("parent_hier_cd", dim.columnName)
})
val fullDf = hierarchies.foldLeft(rawDf) { (df1, df2) => df1.join(df2) }
UPDATE 2
No, that does not work. Spark attempts a cross join.
For my purposes, I simply needed to return a tuple when generating the collection of hierarchies:
val hierarchies = program.dimensions.map(dim => {
val hierarchy = spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).alias(dim.hierarchy.toLowerCase)
(dim, hierarchy)
})
Then when I fold them into rawDf, I have the metadata I need to construct the joins.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either