how to call an udf with multiple arguments(currying) in spark sql?

how to call an udf with multiple arguments(currying) in spark sql? - scala

How do i call the below UDF with multiple arguments(currying) in a spark dataframe as below.
read read and get a list[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
register udf
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
call udf in the below df
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Here is am missing the List[String] argument, and I am really not sure as how should i pass on this argument .

I can make following assumption about your requirement based on your question
a] UDF should accept parameter other than dataframe column
b] UDF should take multiple columns as parameter
Let's say you want to concat values from all column along with specified parameter. Here is how you can do it
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+

Defining a UDF with multiple parameters:
val enrichUDF: UserDefinedFunction = udf((jsonData: String, id: Long) => {
val lastOccurence = jsonData.lastIndexOf('}')
val sid = ",\"site_refresh_stats_id\":" + id+ " }]"
val enrichedJson = jsonData.patch(lastOccurence, sid, sid.length)
enrichedJson
})
Calling the udf to an existing dataframe:
val enrichedDF = EXISTING_DF
.withColumn("enriched_column",
enrichUDF(col("jsonData")
, col("id")))
An import statement is also required as:
import org.apache.spark.sql.expressions.UserDefinedFunction

Related

Create a new column in Spark DataFrame using UDF

I have a UDF as below -
val myUdf = udf((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
Now, I want to use this use this in a function as below -
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
And then use this function to create a new column in my DataFrame
val df = df..withColumn("new_col", myfunc())
In summary, I want my column "new_col" to be an type array with values as [[x, x, x]]
I am getting the below error. What am I doing wrong here?
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported

Two ways.
Don't use a UDF because you're using pure Spark functions:
val myUdf = ((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
)
def myfunc(): Column = {
val myvariable = myUdf("col_abc", "col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
Use a UDF which takes in strings and returns a Seq of case class:
case class cols (col1: String, col2: String, col3: String)
val myUdf = udf((col_abc: String, col_xyz: String) => Seq(cols("x", col_abc, col_xyz)))
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
If you want to pass in Columns to the function, here's an example:
val myUdf = ((col_abc: Column, col_xyz: Column) => {
array(
struct(
lit("x").alias("col1"),
col_abc.alias("col2"),
col_xyz.alias("col3")
)
)
}
)

DataFrame to Array of Jsons

I have a dataframe as below
+-------------+-------------+-------------+
| columnName1 | columnName2 | columnName3 |
+-------------+-------------+-------------+
| 001 | 002 | 003 |
+-------------+-------------+-------------+
| 004 | 005 | 006 |
+-------------+-------------+-------------+
I want to convert to JSON as expected Below Format.
EXPECTED FORMAT
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
Thanks in Advance
I have tried this with playjson api's
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10)
Now it is coming in this format:
["[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]","[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]"]
But i need in this expected format with playjson with encoders
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
facing this issue
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map { row =>
Basically converting Array[String] to Array[Array[Jsvalue]]

The above exception is thrown because Spark does not have an encoder for Jsvalue to deserialize/serialize DF. Check Spark custom encoder for it. However, instead of returning JSValue JS.toString could be returned inside DF map operation.
One approach could be:
Convert row of DF to compatible JSON - [{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]
Collect DF as array/list mkstring using , delimiter
Enclosed above string inside "[]"
Alert below code use Collect, it could choke the Spark driver
//CSV
c1,c2,c3
001,002,003
004,005,006
//Code
val df = spark.read.option("header", "true").csv("array_json.csv")
val allColumns = df.schema.map(s => s.name)
//Import Spark implicits Encoder
import spark.implicits._
val sdf = df.map(row => {
val kv = row.getValuesMap[String](allColumns)
Json.toJson(kv.map(x => {
List(
"key" -> x._1,
"value" -> x._2
).toMap
})).toString()
})
val dfString = sdf.collect().mkString(",")
val op = s"[$dfString]"
println(op)
Output:
[[{"key":"c1","value":"001"},{"key":"c2","value":"002"},{"key":"c3","value":"003"}],[{"key":"c1","value":"004"},{"key":"c2","value":"005"},{"key":"c3","value":"006"}]]

Another approach without RDD:
import spark.implicits._
val df = List((1, 2, 3), (11, 12, 13), (21, 22, 23)).toDF("A", "B", "C")
val fKeyValue = (name: String) =>
struct(lit(name).as("key"), col(name).as("value"))
val lstCol = df.columns.foldLeft(List[Column]())((a, b) => fKeyValue(b) :: a)
val dsJson =
df
.select(collect_list(array(lstCol: _*)).as("obj"))
.toJSON
import play.api.libs.json._
val json: JsValue = Json.parse(dsJson.first())
val arrJson = json \ "obj"
println(arrJson)

val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= Json.parse(DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10).mkstring("[", ", ", "]"))
gives
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]

How call method based on Json Object scala spark?

I Have two functions like below
def method1(ip:String,r:Double,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
def method2(ip:String,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
I want to call this methods by based on Json object parameter .
for example if my input json is like below
{"name":"method1","ip":"Or.csv","r":1.0,"op":"oppath"}
It has to call method1 and "Or.csv",1.0,”oppath" as parameters I.e. in json object name indicate method name, and reaming fields are parameters.
Please help me on this.

First we need to read Json through spark into a dataframe.
val df = sqlContext.read.json("path to the json file")
which should be give you dataframe as
scala> df.show()
+------+-------+------+---+
| ip| name| op| r|
+------+-------+------+---+
|Or.csv|method1|oppath|1.0|
+------+-------+------+---+
Next
scala> def method1(ip:String,r:Double,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method1: (ip: String, r: Double, op: String)Unit
next
scala> def method2(ip:String,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method2: (ip: String, op: String)Unit
next
scala>df.withColumn("methodCalling",when($"name" === "method1",method1(df.first().getString(1),df.first().getDouble(2),df.first().getString(3))).otherwise(when($"name" === "method2", method2(df.first().getString(1),df.first().getString(2)))))
it will call method1 or method2 based on Json Object.

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.

Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!

I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to call an udf with multiple arguments(currying) in spark sql? - scala

Related

Create a new column in Spark DataFrame using UDF

DataFrame to Array of Jsons

How call method based on Json Object scala spark?

How to use the functions.explode to flatten element in dataFrame

NullPointerException when using UDF in Spark

Categories

Resources