Chained Dataframe transformations in Scala with arguments and conditions - scala

I have a dataframe in Scala for which I want to add transformations and filters depending on conditions passed as arguments to the function.
For example, I'm trying to do something like this:
val lst_conditions = List("condition1","condition2",..., "conditionN")
for (condition_string <- lst_conditions) {
var new_df = df.transform(FilterOrNot(condition_string))
}
But how I'm defining the function next doesn't work:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
// If the condition is different do nothing.
}
The error I get is:
<console>:73: error: type mismatch;
found : Unit
required: org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
else if ...
^
How can I implement this?

I think perhaps a little more info about why the next function doesn't work, or what it's not doing would be useful.
The one thing I would maybe recommend adding is a final default to your custom transformation as follows:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
else df
}

Related

how do I loop a array in udf and return multiple variable value

I'm fresh with scala and udf, now I would like to write a udf which accept 3 parameters from a dataframe columns(one of them is array), for..loop current array, parse and return a case class which will be used afterwards. here's a my code roughly:
case class NewFeatures(dd: Boolean, zz: String)
val resultUdf = udf((arrays: Option[Row], jsonData: String, placement: Int) => {
for (item <- arrays) {
val aa = item.getAs[Long]("aa")
val bb = item.getAs[Long]("bb")
breakable {
if (aa <= 0 || bb <= 0) break
}
val cc = item.getAs[Long]("cc")
val dd = cc > 0
val jsonData = item.getAs[String]("json_data")
val jsonDataObject = JSON.parseFull(jsonData).asInstanceOf[Map[String, Any]]
var zz = jsonDataObject.getOrElse("zz", "").toString
NewFeatures(dd, zz)
}
})
when I run it, it will get exception:
java.lang.UnsupportedOperationException: Schema for type Unit is not supported
how should I modify above udf
First of all, try better naming for your variables, for instance in your case, "arrays" is of type Option[Row]. Here, for (item <- arrays) {...} is basically a .map method, using map on Options, you should provide a function, that uses Row and returns a value of some type (~= signature: def map[V](f: Row => V): Option[V], what you want in your case: def map(f: Row => NewFeatures): Option[NewFeature]). While you're breaking out of this map in some circumstances, so there's no assurance for the compiler that the function inside map method would always return an instance of NewFeatures. So it is Unit (it only returns on some cases, and not all).
What you want to do could be enhanced in something similar to this:
val funcName: (Option[Row], String, Int) => Option[NewFeatures] =
(rowOpt, jsonData, placement) => rowOpt.filter(
/* your break condition */
).map { row => // if passes the filter predicate =>
// fetch data from row, create new instance
}

try/catch not working when use tail recursive function

I'm building a tail-recursive function that reads multiple hdfs paths and merges all of them into a single data-frame. The function works perfectly as long as all the path exist, if not, the function fails and does not finish joining the data of the paths that do exist. To solve this problem I have tried to handle the error using try/catch but have not been successful.
The error says: could not optimize #tailrec annotated method loop: it contains a recursive call not in tail position
My function is :
def getRangeData(toOdate: String, numMonths: Int, pathRoot: String, ColumnsTable: List[String]): DataFrame = {
val dataFrameNull = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
StructType((ColumnsTable :+ "odate").map(columnName => StructField(columnName, StringType, true))))
val rangePeriod = getRangeDate(numMonths, toOdate)
#tailrec
def unionRangeData(rangePeriod: List[LocalDate], pathRoot: String, df: DataFrame = dataFrameNull): DataFrame = {
try {
if (rangePeriod.isEmpty) {
df
}
else {
val month = "%02d".format(rangePeriod.head.getMonthValue)
val year = rangePeriod.head.getYear
val odate = rangePeriod.head.toString
val path = s"${pathRoot}/partition_data_year_id=${year}/partition_data_month_id=${month}"
val columns = ColumnsTable.map(columnName => trim(col(columnName)).as(columnName))
val dfTemporal = spark.read.parquet(path).select(columns: _*).withColumn("odate", lit(odate).cast("date"))
unionRangeData(rangePeriod.tail, pathRoot, df.union(dfTemporal))
}
} catch {
case e: Exception =>
logger.error("path not exist")
dataFrameNull
}
}
unionRangeData(rangePeriod, pathRoot)
}
def getRangeDate(numMonths: Int, toOdate: String, listDate: List[LocalDate] = List()): List[LocalDate] = {
if (numMonths == 0) {
listDate
}
else {
getRangeDate(numMonths - 1, toOdate, LocalDate.parse(toOdate).plusMonths(1).minusMonths(numMonths) :: listDate)
}
}
In advance, thank you very much for your help.
I would suggest you remove the try-catch construct entirely from the function and use it instead at the call site at the bottom of getRangeData.
Alternatively you can also use scala.util.Try to wrap the call: Try(unionRangeData(rangePeriod, pathRoot)), and use one of its combinators to perform your logging or provide a default value in the error case.
Related post which explains why the Scala compiler cannot perform tail call optimization inside try-catch:
Why won't Scala optimize tail call with try/catch?

Does the following create a new object on every executor for every Row, in spark?

I have the following example
class Add() { val adding = 2; def getVal = 1 + adding }
val a = List(1,2,3).toDF
a.filter(col("value") === new AddOne().getVal).show()
Will this create a new object (AddOne) on every executor for every Row/ data point ?
No, it will be created only once on the driver.
Here is the simplified code of === method
def === (other: Any): Column = {
val right = Literal.create(other)
EqualTo(expr, right)
}
Where expr is your col("value") that will be substituted by an actual value and right is a foldable literal.
If you have some doubts use df.explain(true) it will help you understand what is going to be executed.
In your case:
== Parsed Logical Plan ==
'Filter ('value = 3)
+- LocalRelation [value#1]

Can I have a condition inside of a where or a filter?

I have a dataframe with many columns, and to explain the situation, let's say, there is a column with letters in it from a-z. I also have a list, which includes some specific letters.
val testerList = List("a","k")
The dataframe has to be filtered, to only include entries with the specified letters in the list. This is very straightforward:
val resultDF = df.where($"column".isin(testerList:_*)))
So the problem is, that the list is given to this function as a parameter, and it can be an empty list, which situation could be solved like this (resultDF is defined here as an empty dataframe):
if (!(testerList.isEmpty)) {
resultDF = df.where(some other stuff has to be filtered away)
.where($"column".isin(testerList:_*)))
} else {
resultDF = df.where(some other stuff has to be filtered away)
}
Is there a way to make this in a more simple way, something like this:
val resultDF = df.where(some other stuff has to be filtered away)
.where((!(testerList.isEmpty)) && $"column".isin(testerList:_*)))
This one throws an error though:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
.where( (!(testerList.isEmpty)) && (($"agent_account_homepage").isin(testerList:_*)))
^
So, thanks a lot for any kind of ideas for a solution!! :)
What about this?
val filtered1 = df.where(some other stuff has to be filtered away)
val resultDF = if (testerList.isEmpty)
filtered1
else
filtered1.where($"column".isin(testerList:_*))
Or if you don't want filtered1 to be available below and perhaps unintentionally used, it can be declared inside a block initializing resultDF:
val resultDF = {
val filtered1 = df.where(some other stuff has to be filtered away)
if (testerList.isEmpty) filtered1 else filtered1.where($"column".isin(testerList:_*))
}
or if you change the order
val resultDF = (if (testerList.isEmpty)
df
else
df.where($"column".isin(testerList:_*))
).where(some other stuff has to be filtered away)
Essentially what Spark expects to receive in where is plain object Column. This means that you can extract all your complicated where logic to separate function:
def testerFilter(testerList: List[String]): Column = testerList match {
//of course, you have to replace ??? with real conditions
//just apend them by joining with "and"
case Nil => $"column".isNotNull and ???
case tl => $"column".isin(tl: _*) and ???
}
And then you just use it like:
df.where(testerFilter(testerList))
The solution I use now, use sql code inside the where clause:
var testerList = s""""""
var cond = testerList.isEmpty().toString
testerList = if (cond == "true") "''" else testerList
val resultDF= df.where(some other stuff has to be filtered away)
.where("('"+cond+"' = 'true') or (agent_account_homepage in ("+testerList+"))")
What do you think?

Spark UDF as function parameter, UDF is not in function scope

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.
Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported