Create new column with function in Spark Dataframe - scala

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.

Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column

We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))

Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))

Related

PySpark - WithColumn

I am new to Pyspark and databricks, I am aware of some basic knowledge of it and now I am having a hard time understanding one expression.
#Calculating workdays using workdaycal()
dfx = (df.withColumn('DAYS_OUTSTANDING_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))).cast(DoubleType()))
.withColumn('TAT_RESOLVED_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("LAST_RESOLVED_DATE_TIME_GMT_1"))).cast(DoubleType()))
.withColumn('TAT_CLOSED_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CLOSED_DATE_TIME_GMT_1"))).cast(DoubleType()))
)
In the above code, I am unable to figure out what does (workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))) do.
workdaycal is a user defined function which is like this
def get_bizday(sdate,edate,holidays_list):
h1,m1,s1=sdate.hour,sdate.minute,sdate.second
// further code in the function and return a float value at the end.
return float(val)
def workdaycal(holidays):
return udf(lambda l,e: get_bizday(l, e, holidays))
holiday_list is list of dates which I am passing to workdaycal.
Could anyone help me in figuring out what does this expression do?
I have simplified your use case for reproducing the results. In the simplified form, the UDF returns string representation of the input arguments.
At present, your problem statement is as follows:
workdaycal returns a "udf".
workdaycal(holidays_list) (col(...),col(...)) first calls workdaycal() which returns "udf" and this "udf" is called and the two "col(...)" are passed along with global variable `holidays_list``.
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
def get_bizday(sdate, edate, holidays_list):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list}")
def workdaycal(holidays):
return udf(lambda l,e: get_bizday(l, e, holidays))
holidays_list = ["Saturday", "Sunday"]
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", (workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))).cast(StringType()))
You can simplify this as a direct call to function get_bizday; but there is a problem: holidays_list is not a Spark type (it is python list).
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
#udf(returnType=StringType())
def get_bizday(sdate, edate, holidays_list):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list}")
holidays_list = ["Saturday", "Sunday"]
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", get_bizday(col("OPEN_DATE_TIME_GMT"), col("CURRENT_DATE_TIME_GMT"), holidays_list))
This results in error complaining that holidays_list should be one of the specified Spark types:
TypeError: Invalid argument, not a string or column: ['Saturday', 'Sunday'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
You can solve this by removing holidays_list from parameter list, and broadcasting it and accessing it in "udf" with holidays_list_broadcast.value. A broadcasted read-only shared global variable is cached and available on all nodes in the cluster.
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
holidays_list_broadcast = spark.sparkContext.broadcast(holidays_list)
#udf(returnType=StringType())
def get_bizday(sdate, edate):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list_broadcast.value}")
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", get_bizday(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT")))
Now you are calling get_bizday() in a simpler and readable way.

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Spark Scala Dataset Validations using UDF and its Performance

I'm new to Spark Scala. I have implemented an solution for Dataset validation for multiple columns using UDF rather than going through individual columns in for loop. But i dint know how this is working faster and i have to explain it was the better solution.
The columns for data validation will be received at run time, so we cannot hard-coded the column names in code. And also the comments column needs to be updated with the column name when column value got failed in validation.
Old Code,
def doValidate(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var i:Int = 0
for (s <- columnArray) {
var list = validValueArrays(i).split(",")
ValidDF = ValidDF.withColumn("comments",when(ValidDF.col(s).isin(list: _*),concat(lit(col("comments")),lit(" Error: Invalid Records in: ") ,lit(s))).otherwise(col("comments")))
i = i + 1
}
return ValidDF;
}
New Code,
def validateColumnValues(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var checkValues = udf((row: Row, comment: String) => {
var newComment = comment
for (s: Int <- 0 to row.length-1) {
var value = row.get(s)
var list = validValueArrays(s).split(",")
if(!list.contains(value))
{
newComment = newComment + " Error:Invalid Records in: " + columnArray(s) +";"
}
}
newComment
});
ValidDF = ValidDF.withColumn("comments",checkValues(struct(columnArray.head, columnArray.tail: _*),col("comments")))
return ValidDF;
}
columnArray --> Will have list of columns
validValueArrays --> Will have Valid Values Corresponding to column array position. The multiple valid values will be , separated.
I want to know which one better or any other better approach to do it. When i tested new code looks better. And also what is the difference between this two logic's as i read UDF is a black-box for Spark. And in this case the UDF will affect performance in any case?
I need to correct some closed bracket before running it. One '}' to be removed when you return the validDF. I still get a runtime analysis error.
It is better to avoid UDF as a UDF implies deserialization to process the data in classic Scala and then reserialize it. However, if your requirement cannot be archived using in build SQL function, then you have to go for UDF but you must make sure you review the SparkUI for performance and plan of execution.

Spark UDF as function parameter, UDF is not in function scope

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.
Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame