How to add columns to df with StructField Array? - scala

I have two dataframes, and I want to add to the first of them all the columns that are in the second, but not in the first. I got an array of StructField columns that I want to add to the dataframe, and fill with nulls.
That's the best I've come up with:
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField]): DataFrame = {
val spark = df.sparkSession
val schema = new StructType(df.schema.toArray ++ columnsToAdd)
spark.createDataFrame(df.rdd, schema)
}
Is there any better way?

My solution that I gave in the question unfortunately does not work. Crashes with the error java.lang.ArrayIndexOutOfBoundsException. As I understand it, the fact is that even though I added columns to the schema, they were not added to the dataframe, spark is trying to access the next data frame field, which is in the schema, but not in the real data.
I wrote such a variant, it uses recursion and does what I want. Although of course I would like to abandon the use of null, and somehow replace it with None.
#tailrec
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField], indx: Int): DataFrame = {
if(columnsToAdd.length == indx || columnsToAdd.isEmpty) df
else {
val dfWithColumn = df.withColumn(columnsToAdd(indx).name, lit(null).cast(columnsToAdd(indx).dataType))
addColumns(dfWithColumn, columnsToAdd, indx + 1)
}
}
Also this answer helped a lot.

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Spark SQL UDF with multiple arg does not work as expected

I am working in Scala programming language. I want to extract certain fields from DataFrame Column -which has json string. I also have list of collection of StructType which tells which schema to extract from particular row. I need to extract different schema for different rows. The collection of schema will contain many rows but I need to fetch only one schema based on another column (same row) in the DataFrame. Here is my code.
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
.....
private def getSchemaUDF : UserDefinedFunction = udf(getSchema _)
private def getSchema(field: String, schemas : Map[String, StructType]): StructType = {
val schema = schemas.filter(x => x._1 == field)
require(schema.size == 1, s"Multiple schemas found for ${field}")
schema.head._2
}
But I get this error in this line:
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
Error:(127, 83) type mismatch;
found : Map[String,org.apache.spark.sql.types.StructType]
required: org.apache.spark.sql.Column
df = df.withColumn("data", from_json(col("data"), getSchemaUDF(col("name"), events)))
Can someone please tell me how can I fix it? Or how can I achieve this?

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

How to handle missing columns in spark sql

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

Spark Join Single Dataframe to a Collection of Dataframes

I'm struggling to figure out an elegant solution to join a single dataframe to a separate sequence of 1 to N related dataframes. Initial attempt:
val sources = program.attributes.map(attr => {
spark.read
.option("header", value = true)
.schema(program.GetSchema(attr))
.csv(s"${program.programRawHdfsDirectory}/${attr.sourceFile}")
})
val rawDf: DataFrame = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
// Full of fail:
val fullDf: DataFrame = program.dimensions.filter(d => d.hierarchy != "RAW").reduceLeft((d1, _) => {
val hierarchy = spark.read.parquet(d1.hierarchyLocation).where(d1.hierarchyFilter)
rawDf.join(hierarchy, d1.hierarchyJoin)
})
fullDf.selectExpr(program.outputFields:_*).write.parquet(program.programEtlHdfsDirectory)
The reduceLeft idea doesn't work because I'm iterating through a collection of configuration objects (the dimensions property), but what I want returned from each iteration is a dataframe. The error is a type mismatch, which is not surprising.
The core of the problem is that I have 1 to N "dimension" objects that define how to load an existing hierarchy table and also how to join that table to my "raw" dataframe I created earlier.
Any idea how I might create these joins without some sort of horrible hack?
UPDATE:
I wonder is this might work? I have a common field name in each hierarchy dataframe that I'm joining to. If I renamed this common field to match the corresponding column in my "raw" dataframe, could I execute the joins in a fold without explicitly calling out the columns? Will Spark just default to the matching names?
val rawDf = sources.reduce((df1, df2) => df1.join(df2, program.dimensionFields, "full"))
val hierarchies = program.dimensions.map(dim => {
spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).withColumnRenamed("parent_hier_cd", dim.columnName)
})
val fullDf = hierarchies.foldLeft(rawDf) { (df1, df2) => df1.join(df2) }
UPDATE 2
No, that does not work. Spark attempts a cross join.
For my purposes, I simply needed to return a tuple when generating the collection of hierarchies:
val hierarchies = program.dimensions.map(dim => {
val hierarchy = spark.read.parquet(dim.hierarchyLocation).where(dim.hierarchyFilter).alias(dim.hierarchy.toLowerCase)
(dim, hierarchy)
})
Then when I fold them into rawDf, I have the metadata I need to construct the joins.