Create a new column in Spark DataFrame using UDF - scala

I have a UDF as below -
val myUdf = udf((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
Now, I want to use this use this in a function as below -
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
And then use this function to create a new column in my DataFrame
val df = df..withColumn("new_col", myfunc())
In summary, I want my column "new_col" to be an type array with values as [[x, x, x]]
I am getting the below error. What am I doing wrong here?
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported

Two ways.
Don't use a UDF because you're using pure Spark functions:
val myUdf = ((col_abc: String, col_xyz: String) => {
array(
struct(
lit("x").alias("col1"),
col(col_abc).alias("col2"),
col(col_xyz).alias("col3")
)
)
}
)
def myfunc(): Column = {
val myvariable = myUdf("col_abc", "col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
Use a UDF which takes in strings and returns a Seq of case class:
case class cols (col1: String, col2: String, col3: String)
val myUdf = udf((col_abc: String, col_xyz: String) => Seq(cols("x", col_abc, col_xyz)))
def myfunc(): Column = {
val myvariable = myUdf($"col_abc", $"col_xyz")
myvariable
}
df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz| new_col|
+-------+-------+---------------+
| abc| xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
If you want to pass in Columns to the function, here's an example:
val myUdf = ((col_abc: Column, col_xyz: Column) => {
array(
struct(
lit("x").alias("col1"),
col_abc.alias("col2"),
col_xyz.alias("col3")
)
)
}
)

Related

How to input and output an Seq of an object to a function in Scala

I want to parse a column to get split values using Seq of an object
case class RawData(rawId: String, rawData: String)
case class SplitData(
rawId: String,
rawData: String,
split1: Option[Int],
split2: Option[String],
split3: Option[String],
split4: Option[String]
)
def rawDataParser(unparsedRawData: Seq[RawData]): Seq[RawData] = {
unparsedrawData.map(rawData => {
val split = rawData.address.split(", ")
rawData.copy(
split1 = Some(split(0).toInt),
split2 = Some(split(1)),
split3 = Some(split(2)),
split4 = Some(split(3))
)
})
}
val rawDataDF= Seq[(String, String)](
("001", "Split1, Split2, Split3, Split4"),
("002", "Split1, Split2, Split3, Split4")
).toDF("rawDataID", "rawData")
val rawDataDS: Dataset[RawData] = rawDataDF.as[RawData]
I need to use rawDataParser function to parse my rawData. However, the parameter to the function is of type Seq. I am not sure how should I convert rawDataDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Each DataSet is further divided into partitions. You can use mapPartitions with a mapping Iterator[T] => Iterator[U] to convert a DataSet[T] into a DataSet[U].
So, you can just use your addressParser as the argument for mapPartition.
val rawAddressDataDS =
spark.read
.option("header", "true")
.csv(csvFilePath)
.as[AddressRawData]
val addressDataDS =
rawAddressDataDS
.map { rad =>
AddressData(
addressId = rad.addressId,
address = rad.address,
number = None,
road = None,
city = None,
country = None
)
}
.mapPartitions { unparsedAddresses =>
addressParser(unparsedAddresses.toSeq).toIterator
}

how to call an udf with multiple arguments(currying) in spark sql?

How do i call the below UDF with multiple arguments(currying) in a spark dataframe as below.
read read and get a list[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
register udf
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
call udf in the below df
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Here is am missing the List[String] argument, and I am really not sure as how should i pass on this argument .
I can make following assumption about your requirement based on your question
a] UDF should accept parameter other than dataframe column
b] UDF should take multiple columns as parameter
Let's say you want to concat values from all column along with specified parameter. Here is how you can do it
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+
Defining a UDF with multiple parameters:
val enrichUDF: UserDefinedFunction = udf((jsonData: String, id: Long) => {
val lastOccurence = jsonData.lastIndexOf('}')
val sid = ",\"site_refresh_stats_id\":" + id+ " }]"
val enrichedJson = jsonData.patch(lastOccurence, sid, sid.length)
enrichedJson
})
Calling the udf to an existing dataframe:
val enrichedDF = EXISTING_DF
.withColumn("enriched_column",
enrichUDF(col("jsonData")
, col("id")))
An import statement is also required as:
import org.apache.spark.sql.expressions.UserDefinedFunction

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Better way to convert a string field into timestamp in Spark

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})