Spark UDF thread safety

Spark UDF thread safety - scala

I'm using Spark to take a dataframe containing a column of dates, and create 3 new columns containing the time in days, weeks, and months between the date in the column and today.
My concern is around the use of SimpleDateFormat, which isn't thread safe. Ordinarily without Spark this would be fine since it's a local variable, but with Spark's lazy evaluation, is sharing a single SimpleDateFormat instance over multiple UDFs likely to cause an issue?
def calcTimeDifference(...){
val sdf = new SimpleDateFormat(dateFormat)
val dayDifference = udf{(x: String) => math.abs(Days.daysBetween(new DateTime(sdf.parse(x)), presentDate).getDays)}
output = output.withColumn("days", dayDifference(myCol))
val weekDifference = udf{(x: String) => math.abs(Weeks.weeksBetween(new DateTime(sdf.parse(x)), presentDate).getWeeks)}
output = output.withColumn("weeks", weekDifference(myCol))
val monthDifference = udf{(x: String) => math.abs(Months.monthsBetween(new DateTime(sdf.parse(x)), presentDate).getMonths)}
output = output.withColumn("months", monthDifference(myCol))
}

I dont think its safe, as we know, SimpleDateFormat is not thread-safe.
So I prefer this method to use SimpleDateFormat in Spark if you need:
import java.text.SimpleDateFormat
import java.util.SimpleTimeZone
/**
* Thread Safe SimpleDateFormat for Spark.
*/
object ThreadSafeFormat extends ThreadLocal[SimpleDateFormat] {
override def initialValue(): SimpleDateFormat = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd:H")
// if you need get UTC time, you can set UTC timezone
val utcTimeZone = new SimpleTimeZone(SimpleTimeZone.UTC_TIME, "UTC")
dateFormat.setTimeZone(utcTimeZone)
dateFormat
}
}
Then use ThreadSafeFormat.get() to get Thread-safe SimpleDateFormat to do anything.

Related

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?

The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.

I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

Subtract Months from YYYYMM date in Scala

I am trying to subtract months from YYYYMM format.
import java.text.SimpleDateFormat
val date = 202012
val dt_format = new SimpleDateFormat("YYYYMM")
val formattedDate = dt_format.format(date)
new DateTime(formattedDate).minusMonths(3).toDate();
Expected output:
202012 - 3 months = 202009,
202012 - 14 months = 201910
But it did not work as expected. Please help!

Among standard date/time types YearMonth seems to be the most appropriate for the given use case.
import java.time.format.DateTimeFormatter
import java.time.YearMonth
val format = DateTimeFormatter.ofPattern("yyyyMM")
YearMonth.parse("197001", format).minusMonths(13) // 1968-12

This solution uses the functionality in java.time, available since Java 8. I would have preferred coming up with a solution that did not require to adjust the input so that it could be (forcefully) parsed into a LocalDate (so that plusMonths) could be used, but at least it works.
Probably a simple regex could get the job done. ;-)
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val inFmt = DateTimeFormatter.ofPattern("yyyyMMdd")
val outFmt = DateTimeFormatter.ofPattern("yyyyMM")
def plusMonths(string: String, months: Int): String =
LocalDate.parse(s"${string}01", inFmt).plusMonths(months).format(outFmt)
assert(plusMonths("202012", -3) == "202009")
assert(plusMonths("202012", -14) == "201910")
You can play around with this code here on Scastie.

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?

import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)

While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

java.text.ParseException: Unparseable date: "Some(2014-05-14T14:40:25.950)"

I need to fetch the date from a file.
Below is my spark program:
import org.apache.spark.sql.SparkSession
import scala.xml.XML
import java.text.SimpleDateFormat
object Active6Month {
def main(args:Array[String]){
val format = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss.SSS")
val format1 = new SimpleDateFormat("yyyy-MM")
val spark = SparkSession.builder.appName("Active6Months").master("local").getOrCreate()
val data = spark.read.textFile("D:\\BGH\\StackOverFlow\\Posts.xml").rdd
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.map{line=>{
val xml = XML.loadString(line)
var closedDate = format1.format(format.parse(xml.attribute("ClosedDate").toString())).toString()
(closedDate,1)
}}.reduceByKey(_+_)
date.foreach(println)
spark.stop
}
}
And I am getting this error:
java.text.ParseException: Unparseable date: "Some(2014-05-14T14:40:25.950)"
The format of date in file is perfect i.e:
CreationDate="2014-05-13T23:58:30.457"
But in error it shows the String "Some" attached to it.
And my other question is why same working in below code:
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.flatMap{line=>{
val xml = XML.loadString(line)
xml.attribute("ClosedDate")
}}.map{line=>{
(format1.format(format.parse(line.toString())).toString(),1)
}}.reduceByKey(_+_)

My guess is that xml.attribute("ClosedDate").toString() is actually returning a string containing Some attached to it. Have you debugged that to make sure?
Maybe you shouldn't use toString(), but instead, get the attribute value, by using the proper method.
Or you can do it the "ugly" way and include "Some" in the pattern:
val format = new SimpleDateFormat("'Some('yyyy-MM-dd'T'hh:mm:ss.SSS')'")
Your second approach works because (and that's a guess because I don't code in Scala), probably the xml.attribute("ClosedDate") method returns an object, and calling toString() on this object returns the string with "Some" attached to it (why? ask the API authors). But when you use map on this object, it sets the line variable to the correct value (without the "Some" part).

Joda Time: Convert UTC to local

I want to convert a Joda Time UTC DateTime object to local time.
Here's a laborious way to do it which seems to work. But there must be a better way.
Here's the code (in Scala) without surrounding declarations:
val dtUTC = new DateTime("2010-10-28T04:00")
println("dtUTC = " + dtUTC)
val dtLocal = timestampLocal(dtUTC)
println("local = " + dtLocal)
def timestampLocal(dtUTC: DateTime): String = {
// This is a laborious way to convert from UTC to local. There must be a better way.
val instantUTC = dtUTC.getMillis
val localDateTimeZone = DateTimeZone.getDefault
val instantLocal = localDateTimeZone.convertUTCToLocal(instantUTC)
val dtLocal = new DateTime(instantLocal)
dtLocal.toString
}
Here's the output:
dtUTC = 2010-10-28T04:00:00.000+11:00
local = 2010-10-28T15:00:00.000+11:00

Here's what I use on a current project.
val marketCentreTime = timeInAnotherTimezone.withZone(DateTimeZone.forID("Australia/Melbourne"))
Does that help?
EDIT:
Here's something that takes a time in the current TZ and converts to Brisbane time. You can use the same principle.
Welcome to Scala version 2.8.0.final (Java HotSpot(TM) Client VM, Java 1.6.0_21).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.joda.time._
import org.joda.time._
scala> def timestampBrisbane(date: DateTime): String = {
| date.withZone(DateTimeZone.forID("Australia/Brisbane")).toString
| }
timestampBrisbane: (date: org.joda.time.DateTime)String
scala> val date = new DateTime
date: org.joda.time.DateTime = 2010-10-28T16:22:03.481+11:00
scala> val dateBrisbane = timestampBrisbane(date)
dateBrisbane: String = 2010-10-28T15:22:03.481+10:00