how to call an udf with multiple arguments(currying) in spark sql? - scala

How do i call the below UDF with multiple arguments(currying) in a spark dataframe as below.
read read and get a list[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
register udf
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
call udf in the below df
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Here is am missing the List[String] argument, and I am really not sure as how should i pass on this argument .

I can make following assumption about your requirement based on your question
a] UDF should accept parameter other than dataframe column
b] UDF should take multiple columns as parameter
Let's say you want to concat values from all column along with specified parameter. Here is how you can do it
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+

Defining a UDF with multiple parameters:
val enrichUDF: UserDefinedFunction = udf((jsonData: String, id: Long) => {
val lastOccurence = jsonData.lastIndexOf('}')
val sid = ",\"site_refresh_stats_id\":" + id+ " }]"
val enrichedJson = jsonData.patch(lastOccurence, sid, sid.length)
enrichedJson
})
Calling the udf to an existing dataframe:
val enrichedDF = EXISTING_DF
.withColumn("enriched_column",
enrichUDF(col("jsonData")
, col("id")))
An import statement is also required as:
import org.apache.spark.sql.expressions.UserDefinedFunction

Related

Create a new column in Spark DataFrame using UDF

I have a UDF as below -
val myUdf = udf((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
Now, I want to use this use this in a function as below -
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
And then use this function to create a new column in my DataFrame
val df = df..withColumn("new_col", myfunc())
In summary, I want my column "new_col" to be an type array with values as [[x, x, x]]
I am getting the below error. What am I doing wrong here?
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
Two ways.
Don't use a UDF because you're using pure Spark functions:
val myUdf = ((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
)
def myfunc(): Column = {
val myvariable = myUdf("col_abc", "col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
Use a UDF which takes in strings and returns a Seq of case class:
case class cols (col1: String, col2: String, col3: String)
val myUdf = udf((col_abc: String, col_xyz: String) => Seq(cols("x", col_abc, col_xyz)))
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
If you want to pass in Columns to the function, here's an example:
val myUdf = ((col_abc: Column, col_xyz: Column) => {
array(
struct(
lit("x").alias("col1"),
col_abc.alias("col2"),
col_xyz.alias("col3")
)
)
}
)

DataFrame to Array of Jsons

I have a dataframe as below
+-------------+-------------+-------------+
| columnName1 | columnName2 | columnName3 |
+-------------+-------------+-------------+
| 001 | 002 | 003 |
+-------------+-------------+-------------+
| 004 | 005 | 006 |
+-------------+-------------+-------------+
I want to convert to JSON as expected Below Format.
EXPECTED FORMAT
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
Thanks in Advance
I have tried this with playjson api's
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10)
Now it is coming in this format:
["[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]","[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]"]
But i need in this expected format with playjson with encoders
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
facing this issue
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map { row =>
Basically converting Array[String] to Array[Array[Jsvalue]]
The above exception is thrown because Spark does not have an encoder for Jsvalue to deserialize/serialize DF. Check Spark custom encoder for it. However, instead of returning JSValue JS.toString could be returned inside DF map operation.
One approach could be:
Convert row of DF to compatible JSON - [{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]
Collect DF as array/list mkstring using , delimiter
Enclosed above string inside "[]"
Alert below code use Collect, it could choke the Spark driver
//CSV
c1,c2,c3
001,002,003
004,005,006
//Code
val df = spark.read.option("header", "true").csv("array_json.csv")
val allColumns = df.schema.map(s => s.name)
//Import Spark implicits Encoder
import spark.implicits._
val sdf = df.map(row => {
val kv = row.getValuesMap[String](allColumns)
Json.toJson(kv.map(x => {
List(
"key" -> x._1,
"value" -> x._2
).toMap
})).toString()
})
val dfString = sdf.collect().mkString(",")
val op = s"[$dfString]"
println(op)
Output:
[[{"key":"c1","value":"001"},{"key":"c2","value":"002"},{"key":"c3","value":"003"}],[{"key":"c1","value":"004"},{"key":"c2","value":"005"},{"key":"c3","value":"006"}]]
Another approach without RDD:
import spark.implicits._
val df = List((1, 2, 3), (11, 12, 13), (21, 22, 23)).toDF("A", "B", "C")
val fKeyValue = (name: String) =>
struct(lit(name).as("key"), col(name).as("value"))
val lstCol = df.columns.foldLeft(List[Column]())((a, b) => fKeyValue(b) :: a)
val dsJson =
df
.select(collect_list(array(lstCol: _*)).as("obj"))
.toJSON
import play.api.libs.json._
val json: JsValue = Json.parse(dsJson.first())
val arrJson = json \ "obj"
println(arrJson)
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= Json.parse(DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10).mkstring("[", ", ", "]"))
gives
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]

How call method based on Json Object scala spark?

I Have two functions like below
def method1(ip:String,r:Double,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
def method2(ip:String,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
I want to call this methods by based on Json object parameter .
for example if my input json is like below
{"name":"method1","ip":"Or.csv","r":1.0,"op":"oppath"}
It has to call method1 and "Or.csv",1.0,ā€¯oppath" as parameters I.e. in json object name indicate method name, and reaming fields are parameters.
Please help me on this.
First we need to read Json through spark into a dataframe.
val df = sqlContext.read.json("path to the json file")
which should be give you dataframe as
scala> df.show()
+------+-------+------+---+
| ip| name| op| r|
+------+-------+------+---+
|Or.csv|method1|oppath|1.0|
+------+-------+------+---+
Next
scala> def method1(ip:String,r:Double,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method1: (ip: String, r: Double, op: String)Unit
next
scala> def method2(ip:String,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method2: (ip: String, op: String)Unit
next
scala>df.withColumn("methodCalling",when($"name" === "method1",method1(df.first().getString(1),df.first().getDouble(2),df.first().getString(3))).otherwise(when($"name" === "method2", method2(df.first().getString(1),df.first().getString(2)))))
it will call method1 or method2 based on Json Object.

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.