Better way to convert a string field into timestamp in Spark - scala

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.

Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.

I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))

I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.

Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)

I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.

I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)

I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})

Related

Store Schema of Read File Into csv file in spark scala

i am reading a csv file using inferschema option enabled in data frame using below command.
df2 = spark.read.options(Map("inferSchema"->"true","header"->"true")).csv("s3://Bucket-Name/Fun/Map/file.csv")
df2.printSchema()
Output:
root
|-- CC|Fun|Head|Country|SendType: string (nullable = true)
Now I would like to store the above output only into a csv file having just these column names and datatype of these columns like below.
column_name,datatype
CC,string
Fun,string
Head,string
Country,string
SendType,string
I tried writing this into a csv using below option, but this is writing the file with entire data.
df2.coalesce(1).write.format("csv").mode("append").save("schema.csv")
regards
mahi
df.schema.fields to get fields & its datatype.
Check below code.
scala> val schema = df.schema.fields.map(field => (field.name,field.dataType.typeName)).toList.toDF("column_name","datatype")
schema: org.apache.spark.sql.DataFrame = [column_name: string, datatype: string]
scala> schema.show(false)
+---------------+--------+
|column_name |datatype|
+---------------+--------+
|applicationName|string |
|id |string |
|requestId |string |
|version |long |
+---------------+--------+
scala> schema.write.format("csv").save("/tmp/schema")
Try something like below use coalesce(1) and .option("header","true") to output with header
import java.io.FileWriter
object SparkSchema {
def main(args: Array[String]): Unit = {
val fw = new FileWriter("src/main/resources/csv.schema", true)
fw.write("column_name,datatype\n")
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("", "", "", 1l)).toDF("applicationName", "id", "requestId", "version")
val columnList : List[(String, String)] = df.schema.fields.map(field => (field.name, field.dataType.typeName))
.toList
try {
val outString = columnList.map(col => {
col._1 + "," + col._2
}).mkString("\n")
fw.write(outString)
}
finally fw.close()
val newColumnList : List[(String, String)] = List(("newColumn","integer"))
val finalColList = columnList ++ newColumnList
writeToS3("s3://bucket/newFileName.csv",finalColList)
}
def writeToS3(s3FileNameWithpath : String,finalColList : List[(String,String)]) {
val outString = finalColList.map(col => {
col._1 + "," + col._2
}).mkString("\\n")
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set("fs.s3a.access.key", "YOUR ACCESS KEY")
conf.set("fs.s3a.secret.key", "YOUR SECRET KEY")
val dest = new Path(s3FileNameWithpath)
val fs = dest.getFileSystem(conf)
val out = fs.create(dest, true)
out.write( outString.getBytes )
out.close()
}
}
An alternative to #QuickSilver's and #Srinivas' solutions, which they should both work, is to use the DDL representation of the schema. With df.schema.toDDL you get:
CC STRING, fun STRING, Head STRING, Country STRING, SendType STRING
which is the string representation of the schema then you can split and replace as shown next:
import java.io.PrintWriter
val schema = df.schema.toDDL.split(",")
// Array[String] = Array(`CC` STRING, `fun` STRING, `Head` STRING, `Country` STRING, `SendType` STRING)
val writer = new PrintWriter("/tmp/schema.csv")
writer.write("column_name,datatype\n")
schema.foreach{ r => writer.write(r.replace(" ", ",") + "\n") }
writer.close()
To write to S3 you can use Hadoop API as QuickSilver already implemented or a 3rd party library such as MINIO:
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
minioClient.putObject("YOUR_BUCKET","schema.csv", "/tmp/schema.csv", null)
Or even better by generating a string, storing it into a buffer and then send it via InputStream to S3:
import java.io.ByteArrayInputStream
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
val schema = df.schema.toDDL.split(",")
val schemaBuffer = new StringBuilder
schemaBuffer ++= "column_name,datatype\n"
schema.foreach{ r => schemaBuffer ++= r.replace(" ", ",") + "\n" }
val inputStream = new ByteArrayInputStream(schemaBuffer.toString.getBytes("UTF-8"))
minioClient.putObject("YOUR_BUCKET", "schema.csv", inputStream, new PutObjectOptions(inputStream.available(), -1))
inputStream.close
#PySpark
df_schema = spark.createDataFrame([(i.name, str(i.dataType)) for i in df.schema.fields], ['column_name', 'datatype'])
df_schema.show()
This will create new dataFrame for schema of existing dataframe
UseCase:
Useful when you want create table with Schema of the dataframe & you cannot use below code as pySpark user may not be authorized to execute DDL commands on database.
df.createOrReplaceTempView("tmp_output_table")
spark.sql("""drop table if exists schema.output_table""")
spark.sql("""create table schema.output_table as select * from tmp_output_table""")
In Pyspark - You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes. Follow this link for more details pyspark.sql.DataFrame.dtypes
Having said that, try using below code -
data = df.dtypes
cols = ["col_name", "datatype"]
df = spark.createDataFrame(data=data,schema=cols)
df.show()

Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

The problem is: I have a dataset where a column having 2 or more types of date format.
In general I select all values as String type and then use the to_date to parse the date.
But I don't know how do I parse a column having two or more types of date formats.
val DF= Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
Output from the above command:
null
null
null
0019-04-13
0019-04-14
0019-04-15
0019-04-16
The code above I have written is not working for the format MM/dd/yyyy and the format which did not provided for that I am getting the null as a output.
So seeking the help to parse the file with different date formats.
If possible kindly also share some tutorial or notes to the deal with the date formats.
Please note: I am using Scala for the spark framework.
Thanks in advance.
Check EDIT section to use Column functions instead of UDF for performance benefits in later part of this solution --
Well, Let's do it try-catch way.. Try a column conversion against each format and keep the success value.
You may have to provide all possible format from outside as parameter or keep a master list of all possible formats somewhere in code itself..
Here is the possible solution.. ( Instead of SimpleDateFormatter which sometimes have issues on timestamps beyond milliseconds, I use new library - java.time.format.DateTimeFormatter)
Create a to_timestamp Function, which accepts string to convert to timestamp and all possible Formats
import java.time.LocalDate
import java.time.LocalDateTime
import java.time.LocalTime
import java.time.format.DateTimeFormatter
import scala.util.Try
def toTimestamp(date: String, tsformats: Seq[String]): Option[java.sql.Timestamp] = {
val out = (for (tsft <- tsformats) yield {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern(tsft).toFormatter()
if (Try(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter))).isSuccess)
Option(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter)))
else None
}).filter(_.isDefined)
if (out.isEmpty) None else out.head
}
Create a UDF on top of it - ( this udf takes Seq of Format strings as parameter)
def UtoTimestamp(tsformats: Seq[String]) = org.apache.spark.sql.functions.udf((date: String) => toTimestamp(date, tsformats))
And now, simply use it in your spark code.. Here's the test with your Data -
val DF = Seq(("02-04-2020 08:02"), ("03-04-2020 10:02"), ("04-04-2020 09:00"), ("04/13/19 9:12"), ("04/14/19 2:13"), ("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
val tsformats = Seq("MM-dd-yyyy HH:mm", "MM/dd/yy H:mm")
DF.select(UtoTimestamp(tsformats)('DOB)).show
And here is the output -
+-------------------+
| UDF(DOB)|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
Cherry on top would be to avoid having to write UtoTimestamp(colname) for many columns in your dataframe.
Let's write a function which accepts a Dataframe, List of all Timestamp columns, And all possible formats which your source data may have coded timestamps in..
It'd parse all timestamp columns for you with trying against formats..
def WithTimestampParsed(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) UtoTimestamp(tsformats)(col(c)) alias (c)
else col(c)
}
}
Use it like this -
// You can pass as many column names in a sequence to be parsed
WithTimestampParsed(DF, Seq("DOB"), tsformats).show
Output -
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
EDIT -
I saw latest spark code, and they are also using java.time._ utils now to parse dates and timestamps which enable handling beyond Milliseconds.. Earlier these functions were based on SimpleDateFormat ( I wasn't relying on to_timestamps of spark earlier due to this limit) .
So with to_date & to_timestamp functions being so reliable now.. Let's use them instead of having to write a UDF.. Let's write a function which operates on Columns.
def to_timestamp_simple(col: org.apache.spark.sql.Column, formats: Seq[String]): org.apache.spark.sql.Column = {
coalesce(formats.map(fmt => to_timestamp(col, fmt)): _*)
}
and with this WithTimestampParsedwould look like -
def WithTimestampParsedSimple(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) to_timestamp_simple(col(c), tsformats) alias (c)
else col(c)
}
}
df.select(colSelector: _*)
}
And use it like -
DF.select(to_timestamp_simple('DOB,tsformats)).show
//OR
WithTimestampParsedSimple(DF, Seq("DOB"), tsformats).show
Output looks like -
+---------------------------------------------------------------------------------------+
|coalesce(to_timestamp(`DOB`, 'MM-dd-yyyy HH:mm'), to_timestamp(`DOB`, 'MM/dd/yy H:mm'))|
+---------------------------------------------------------------------------------------+
| 2020-02-04 08:02:00|
| 2020-03-04 10:02:00|
| 2020-04-04 09:00:00|
| 2019-04-13 09:12:00|
| 2019-04-14 02:13:00|
| 2019-04-15 10:14:00|
| 2019-04-16 05:15:00|
+---------------------------------------------------------------------------------------+
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
I put some code that maybe can help you in some way.
I tried this
mport org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.sql.Date
import java.util.{GregorianCalendar}
object DateFormats {
val spark = SparkSession
.builder()
.appName("Multiline")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "Multiline") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import spark.implicits._
val DF = Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
DOBDF.show()
// todo: my code below
DF
.rdd
.map(r =>{
if(r.toString.contains("-")) {
val dat = r.toString.substring(1,11).split("-")
val calendar = new GregorianCalendar(dat(2).toInt,dat(1).toInt - 1,dat(0).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
} else {
val dat = r.toString.substring(1,9).split("/")
val calendar = new GregorianCalendar(dat(2).toInt + 2000,dat(0).toInt - 1,dat(1).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
}
})
.toDF("DOB","DATE")
.show()
// To have the opportunity to view the web console of Spark: http://localhost:4040/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped.")
spark.stop()
println("SparkSession stopped.")
}
}
}
+------------------+----------+
| DOB| DATE|
+------------------+----------+
|[02-04-2020 08:02]|2020-04-02|
|[03-04-2020 10:02]|2020-04-03|
|[04-04-2020 09:00]|2020-04-04|
| [04/13/19 9:12]|2019-04-13|
| [04/14/19 2:13]|2019-04-14|
| [04/15/19 10:14]|2019-04-15|
| [04/16/19 5:15]|2019-04-16|
+------------------+----------+
Regards
We can use coalesce function as mentioned in the accepted answer. On each format mismatch, to_date returns null, which makes coalesce to move to the next format in the list.
But with to_date, if you have issues in parsing the correct year component in the date in yy format (In the date 7-Apr-50, if you want 50 to be parsed as 1950 or 2050), refer to this stackoverflow post
import org.apache.spark.sql.functions.coalesce
// Reference: https://spark.apache.org/docs/3.0.0/sql-ref-datetime-pattern.html
val parsedDateCol: Column = coalesce(
// Four letters of M looks for full name of the Month
to_date(col("original_date"), "MMMM, yyyy"),
to_date(col("original_date"), "dd-MMM-yy"),
to_date(col("original_date"), "yyyy-MM-dd"),
to_date(col("original_date"), "d-MMM-yy")
)
// I have used some dummy dataframe name.
dataframeWithDateCol.select(
parsedDateCol.as("parsed_date")
)
.show()

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Spark scala udf error for if else

I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !

spark map partitions to fill nan values

I want to fill nan values in spark using the last good known observation - see: Spark / Scala: fill nan with last good observation
My current solution used window functions in order to accomplish the task. But this is not great, as all values are mapped into a single partition.
val imputed: RDD[FooBar] = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => fill(i, iter) } should work a lot better. But strangely my fill function is not executed. What is wrong with my code?
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
Here is the full example code:
import java.sql.Date
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class FooBar(foo: Date, bar: String)
object WindowFunctionExample extends App {
Logger.getLogger("org").setLevel(Level.WARN)
val conf: SparkConf = new SparkConf()
.setAppName("foo")
.setMaster("local[*]")
val spark: SparkSession = SparkSession
.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val myDff = Seq(("2016-01-01", "first"), ("2016-01-02", "second"),
("2016-wrongFormat", "noValidFormat"),
("2016-01-04", "lastAssumingSameDate"))
val recordsDF = myDff
.toDF("foo", "bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
recordsDF.show
def notMissing(row: FooBar): Boolean = {
row.foo != null
}
val toCarry = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }.collectAsMap
println("###################### carry ")
println(toCarry)
println(toCarry.foreach(println))
println("###################### carry ")
val toCarryBd = spark.sparkContext.broadcast(toCarry)
def fill(i: Int, iter: Iterator[FooBar]): Iterator[FooBar] = {
var lastNotNullRow: FooBar = toCarryBd.value(i).get
iter.map(row => {
if (!notMissing(row))1
FooBar(lastNotNullRow.foo, row.bar)
else {
lastNotNullRow = row
row
}
})
}
// The algorithm does not step into the for loop for filling the null values. Strange
val imputed: RDD[FooBar] = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => fill(i, iter) }
val imputedDF = imputed.toDS()
println(imputedDF.orderBy($"foo").collect.toList)
imputedDF.show
spark.stop
}
edit
I fixed the code as outlined by the comment. But the toCarryBd contains None values. How can this happen as I did filter explicitly for
def notMissing(row: FooBar): Boolean = {row.foo != null}
iter.filter(notMissing(_)).toSeq.lastOption
non None values.
(2,None)
(5,None)
(4,None)
(7,Some(FooBar(2016-01-04,lastAssumingSameDate)))
(1,Some(FooBar(2016-01-01,first)))
(3,Some(FooBar(2016-01-02,second)))
(6,None)
(0,None)
This leads to NoSuchElementException: None.getwhen trying to access toCarryBd.
Firstly, if your foo field can be null, I would recommend creating the case class as:
case class FooBar(foo: Option[Date], bar: String)
Then, you can rewrite your notMissing function to something like:
def notMissing(row: Option[FooBar]): Boolean = row.isDefined && row.get.foo.isDefined