Use WithColumn with external function - scala

I have data in a DataFrame with below columns
Fileformat is csv
All below column datatypes are String
employeeid,pexpense,cexpense
Now I need to create a new DataFrame which has new column called expense, which is calculated based on columns pexpense, cexpense.
The tricky part is the calculation algorithm is not an UDF function which I created, but it's an external function that needs to be imported from a Java library which takes primitive types as arguments - in this case pexpense, cexpense - to calculate the value required for new column.
The function signature which is from an external Java jar
public class MyJava
{
public Double calculateExpense(Double pexpense, Double cexpense) {
// calculation
}
}
So how can I invoke that external function to create a new calculated column. Can I register that external function as UDF in my Spark application?

You can create your UDF of the external method similar to the following (illustrated using Scala REPL):
// From a Linux shell prompt:
vi MyJava.java
public class MyJava {
public Double calculateExpense(Double pexpense, Double cexpense) {
return pexpense + cexpense;
}
}
:wq
javac MyJava.java
jar -cvf MyJava.jar MyJava.class
spark-shell --jars /path/to/jar/MyJava.jar
// From within the Spark shell
val df = Seq(
("1", "1.0", "2.0"), ("2", "3.0", "4.0")
).toDF("employeeid", "pexpense", "cexpense")
val myJava = new MyJava
val myJavaUdf = udf(
myJava.calculateExpense _
)
val df2 = df.withColumn("totalexpense", myJavaUdf($"pexpense", $"cexpense") )
df2.show
+----------+--------+--------+------------+
|employeeid|pexpense|cexpense|totalexpense|
+----------+--------+--------+------------+
| 1| 1.0| 2.0| 3.0|
| 2| 3.0| 4.0| 7.0|
+----------+--------+--------+------------+

You can simply "wrap" the given method in a UDF by passing it as an argument to the udf function in org.apache.spark.sql.functions:
import org.apache.spark.sql.functions._
import spark.implicits._
val myUdf = udf(calculateExpense _)
val newDF = df.withColumn("expense", myUdf($"pexpense", $"cexpense"))
This assumes pexpense and cexpense columns are both Doubles.

bellow, is an example of sum two columns:
val somme= udf((a: Int, b: int) => a+b)
val df_new = df.select(col("employeeid"), \
col("pexpense"), \
col("pexpense"), \
somme(col("pexpense"), col("pexpense")) as "expense")

Related

How to wrap multiple sql functions into a UDF in Spark?

I am working with Spark 2.3.2.
On one column within my Dataframe I am performing many spark.sql.functions sequentually. How can I wrap this sequence of functions into a user-defined-function (UDF) to make it reusable?
Here is my example focusing on the one column "columnName". First I am creating my test data:
val testSchema = new StructType()
.add("columnName", new StructType()
.add("2020-11", LongType)
.add("2020-12", LongType)
)
val testRow = Seq(Row(Row(1L, 2L)))
val testRDD = spark.sparkContext.parallelize(testRow)
val testDF = spark.createDataFrame(testRDD, testSchema)
testDF.printSchema()
/*
root
|-- columnName: struct (nullable = true)
| |-- 2020-11: long (nullable = true)
| |-- 2020-12: long (nullable = true)
*/
testDF.show(false)
/*
+----------+
|columnName|
+----------+
|[1, 2] |
+----------+
*/
And here is the sequence of applied Spark SQL functions (just as an example):
val testResult = testDF
.select(explode(split(regexp_replace(to_json(col("columnName")), "[\"{}]", ""), ",")).as("result"))
I am failing to create a UDF "myUDF", such that I can get the same result when calling
val testResultWithUDF = testDF.select(myUDF(col("columnName"))
This is what I "would like" to do:
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ",")
}
val myUDF = udf(parseAndExplode _)
testDF.withColumn("udf_result", myUDF(col("columnName"))).show(false)
but it is throwing an Exception:
Schema for type org.apache.spark.sql.Column is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
Also tried with using a Row as input parameter but then again failed trying to apply built-in SQL functions.
There is no need to use an udf here. explode, split and most other functions from org.apache.spark.sql.functions return already an object of type Column.
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ","))
}
testDF.withColumn("udf_result",parseAndExplode('columnName)).show(false)
prints
+----------+----------+
|columnName|udf_result|
+----------+----------+
|[1, 2] |2020-11:1 |
|[1, 2] |2020-12:2 |
+----------+----------+

Pass Spark SQL function name as parameter in Scala

I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!

spark convert dataframe to dataset using case class with option fields

I have the following case class:
case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}
And the following json:
{ "name": "bemjamin", "age" : 1 }
When I try to transform my dataframe into a dataset:
spark.read.json("example.json")
.as[Person].show()
It shows me the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'lastname' given input columns: [age, name];
My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?
I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.
We have one more option to solve above issue.There are 2 steps required
Make sure that fields that can be missing are declared as nullable
Scala types(like Option[_]).
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
You can update code as below.
val schema = Encoders.product[Person].schema
val df = spark.read
.schema(schema)
.json("/Users/../Desktop/example.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
When you are performing spark.read.json("example.json").as[Person].show(), it is basically reading the dataframe as ,
FileScan json [age#6L,name#7]
and then trying to apply the Encoders for Person object hence getting the AnalysisException as it is not able to find lastname from your json file.
Either you could hint spark that lastname is optional by supplying some data that has lastname or
try this:
val schema: StructType = ScalaReflection.schemaFor[Person].dataType.asInstanceOf[StructType]
val x = spark.read
.schema(schema)
.json("src/main/resources/json/x.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
Hope it helps.

How to pass dataset column value to a function while using spark filter with scala?

I have an action array which consists of user id and action type
+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
I want to filter actions that belongs to the users who have at least one search action.
So I created a bloom filter with user ids who has SEARCH action.
Then I tried to filter all actions depending on bloom filter's user status
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
But the code gives an exception
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how can I pass column value to the BloomFilter.mightContainString method?
Create filter:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
Use udf for filtering:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
If your current Bloom filter implementation is serializable you should be able to use it the same way, but if data is large enough to justify Bloom filter, you should avoid collecting.
You can do something like this,
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.
import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)
val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))
df.filter(filterUDF($"user_id"))
You can remove the broadcasting if the bloomFilter instance is serializable.
Hope this helps, Cheers.

How to write "like '%ABC%' " in Spark [duplicate]

How do I write SQL statement to reach the goal like the following statement:
SELECT * FROM table t WHERE t.a LIKE '%'||t.b||'%';
Thanks.
spark.sql.Column provides like method but as for now (Spark 1.6.0 / 2.0.0) it works only with string literals. Still you can use raw SQL:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // Make sure you use HiveContext
import sqlContext.implicits._ // Optional, just to be able to use toDF
val df = Seq(("foo", "bar"), ("foobar", "foo"), ("foobar", "bar")).toDF("a", "b")
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE a LIKE CONCAT('%', b, '%')")
// +------+---+
// | a| b|
// +------+---+
// |foobar|foo|
// |foobar|bar|
// +------+---+
or expr / selectExpr:
df.selectExpr("a like CONCAT('%', b, '%')")
In Spark 1.5 it will requireHiveContext. If for some reason Hive context is not an option you can use custom udf:
import org.apache.spark.sql.functions.udf
val simple_like = udf((s: String, p: String) => s.contains(p))
df.where(simple_like($"a", $"b"))
val regex_like = udf((s: String, p: String) =>
new scala.util.matching.Regex(p).findFirstIn(s).nonEmpty)
df.where(regex_like($"a", $"b"))