Spark UDF as function parameter, UDF is not in function scope - scala

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.

Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported

Related

Chained Dataframe transformations in Scala with arguments and conditions

I have a dataframe in Scala for which I want to add transformations and filters depending on conditions passed as arguments to the function.
For example, I'm trying to do something like this:
val lst_conditions = List("condition1","condition2",..., "conditionN")
for (condition_string <- lst_conditions) {
var new_df = df.transform(FilterOrNot(condition_string))
}
But how I'm defining the function next doesn't work:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
// If the condition is different do nothing.
}
The error I get is:
<console>:73: error: type mismatch;
found : Unit
required: org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
else if ...
^
How can I implement this?
I think perhaps a little more info about why the next function doesn't work, or what it's not doing would be useful.
The one thing I would maybe recommend adding is a final default to your custom transformation as follows:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
else df
}

Creating a list of StructFields from data frame

I need to ultimately build a schema from a CSV. I can read the CSV into data frame, and I've got a case class defined.
case class metadata_class (colname:String,datatype:String,length:Option[Int],precision:Option[int])
val foo = spark.read.format("csv").option("delimiter",",").option("header","true").schema(Encoders.product[metadata_class.schema).load("/path/to/file").as[metadata_file].toDF()
Now I'm trying to iterate through that data frame and build a list of StructFields. My current effort:
val sList: List[StructField] = List(
for (m <- foo.as[metadata_class].collect) {
StructField[m.colname,getType(m.datatype))
})
That gives me a type mismatch:
found : Unit
required: org.apache.spark.sql.types.StructField
for (m <- foo.as[metadata_class].collect) {
^
What am I doing wrong here? Or am I not even close?
There is not usual to use for-loop in scala. For loop has Unit return type, and in your code, result value of sList will be List[Unit]:
val sList: List[Unit] = List(
for (m <- foo.as[metadata_class].collect) {
StructField(m.colname, getType(m.datatype))
}
)
but you declared sList as List[StructField] this is the cause of compile error.
I suppose you should use map function instead of for loop for iterate on metadata_class objects and create StructFields from them:
val structFields: List[StructField] = foo.as[metadata_class]
.collect
.map(m => StructField(m.colname, getType(m.datatype)))
.toList
you will earn List[StructField] such way.
In scala language every statement is expression with return type, for-loop also and it return type is Unit.
read more about statements/expressions:
statement vs expression in scala
statements and expressions in scala

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame

Convert java.util.Map to Scala List[NewObject]

I have a java.util.Map[String, MyObject] and want to create a Scala List[MyNewObject] consisting of alle entries of the map with some special values.
I found a way but, well, this is really ugly:
val result = ListBuffer[MyNewObject]()
myJavaUtilMap.forEach
(
(es: Entry[String, MyObject]) =>
{ result += MyNewObject(es.getKey(), ey.getValue().getMyParameter); println("Aa")}
)
How can I get rid of the println("Aa")? Just deleting does not help because foreach needs a Consumer but the += operation yields a list....
Is there a more elegant way to convert the java.util.Map to a List[MyNewObject]?
Scala has conversions that give you all the nice methods of the Scala collection API on Java collections:
import collection.JavaConversions._
val result = myJavaUtilMap.map{
case (k,v) => MyNewObject(k, v.getMyParameter)
}.toList
By the way: to define a function which returns Unit, you can explicitly specify the return type:
val f = (x: Int) => x: Unit

Create new column with function in Spark Dataframe

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.
Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column
We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))
Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))