Date diff of ISO 8601 timestemp strings in Spark - scala

I have two datetime strings in ISO 8601 format:
2017-05-30T09:15:06.050298Z
2017-05-30T09:15:06.054939Z
I want the time difference between above two strings using Scala in Spark environment.

As you said in the comments you're using Joda-Time, here's an answer using it.
You said that you're calling daysBetween. But both dates are in the same day, so the result will always be zero. To get the difference between the dates with millisecond precision, just subtract the millis value from both DateTime objects:
import org.joda.time.DateTime
val s1 = "2017-05-30T09:15:06.050298Z"
val s2 = "2017-05-30T09:15:06.054939Z"
val diffInMillis = DateTime.parse(s2).getMillis() - DateTime.parse(s1).getMillis()
The diffInMillis will be 4 - the first date's fraction-of-second is 050298 and the second's is 054939, but joda's DateTime has milliseconds precision, so the last 3 digits are discarded. You can check that by doing:
println(DateTime.parse(s1))
println(DateTime.parse(s2))
This will output:
2017-05-30T09:15:06.050Z
2017-05-30T09:15:06.054Z
As you can see, the difference between the dates is 4 milliseconds.
New Java Date/Time API
Joda-Time is in maintainance mode and is being replaced by the new APIs, so I don't recommend start a new project with it. Even in joda's website it says: "Note that Joda-Time is considered to be a largely “finished” project. No major enhancements are planned. If using Java SE 8, please migrate to java.time (JSR-310)."
If you have the new java.time API available (JDK >= 1.8), you can also use it. If java.time classes are not available (JDK <= 1.7), you can try the scala time, which is based on the ThreeTen Backport, a great backport for Java 8's new date/time classes.
The code below works for both.
The only difference is the package names (in Java 8 is java.time and in ThreeTen Backport (or Android's ThreeTenABP) is org.threeten.bp), but the classes and methods names are the same.
The difference is that this API has nanoseconds precision, so you can get the difference between the dates in nanoseconds.
import java.time.Instant
import java.time.temporal.ChronoUnit
val s1 = "2017-05-30T09:15:06.050298Z"
val s2 = "2017-05-30T09:15:06.054939Z"
// difference in nanoseconds
val diffInNanos = ChronoUnit.NANOS.between(Instant.parse(s1), Instant.parse(s2))
The value of diffInNanos is 4641000. If you still want this value in milliseconds, you can divide it by 1000000, or use ChronoUnit.MILLIS instead of ChronoUnit.NANOS.

Nanoseconds with LocalDateTime of Java 8
As Spark does not support date diff above Seconds, we need to create a UDF for Millis or Nanos.
Date time related imports
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoField;
Create UDF to date diff by nanoseconds
spark.udf.register("date_diff_nano", (d1: String, d2: String) =>
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.n'Z'")
val dt1 = LocalDateTime.parse(d1, dtFormatter)
val dt2 = LocalDateTime.parse(d2, dtFormatter)
(dt1.getLong(ChronoField.NANO_OF_DAY) - dt2.getLong(ChronoField.NANO_OF_DAY))
)
Check: help in building DateTimeFormatter pattern
By modifying ChronoField.NANO_OF_DAY to ChronoField.MICRO_OF_DAY
in UDF last line, we can get the date diff in micro sec also.
Now, use the UDF on any DataFrame/DataSet object.
import spark.implicits._ //to use $-notation on columns
// create the dataframe df
val df = ...
val resultDf = df.withColumn("date_diff", date_diff_nano($"dt1", $"dt2"))
Here dt1 and dt2 are DateTime columns in df
Seconds diff with unix_timestamp of Spark SQL
Use Spark SQL predefined unix_timestamp(date, format) function to convert a date to seconds of the day (But Java SimpleDateFormat can support parsing up to Milliseconds), then you can do Date diff with Spark SQL using unix_timestamp.
import org.apache.spark.sql.functions.unix_timestamp
val resultDf = df.withColumn("date_diff_sec",
(unix_timestamp($"dt1"), unix_timestamp($"dt2")))
Days diff b/w two dates using datediff
It accepts the datetime value of following formats
java.sql.Timestamp
java.sql.Date
String format of 'YYYY-MM-dd'
String format of 'YYYY-MM-dd HH:mm:ss'
import org.apache.spark.sql.functions.datediff
val resultDf = df.withColumn("date_diff_days", datediff($"dt1", $"dt2"))

You can use the xml date parser, since it must adhere to ISO-8601:
val t1 = javax.xml.bind.DatatypeConverter.parseDateTime("2017-05-30T09:15:06.050298Z")
val t2 = javax.xml.bind.DatatypeConverter.parseDateTime("2017-05-30T09:15:06.054939Z")
val diff = t1.getTimeInMillis - t2.getTimeInMillis

Related

Convert date to another format Scala Spark

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3
Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Convert timestamp column from UTC to EST in spark scala

I have a column in spark dataframe of timestamp type with date format like '2019-06-13T11:39:10.244Z'
My goal is to convert this column into EST time(subtracting 4 hours) keeping the same format.
I tried it using from_utc_timestamp api but it seems it is converting the UTC time to my local timezone (+5:30) and adding it to the timestamp then subtracting 4 hours from it. I tried to use Joda time but for some reason it is adding 33 days to the EST time
innput = 2019-06-13T11:39:10.244Z
using from_utc_timestamp api:
val tDf = df.withColumn("newTimeCol", to_utc_timestamp(col("timeCol"), "America/New_York"))
output = 2019-06-13T13:09:10.244Z+5:30
using Joda time package:
val coder : (String => String) = (arg: String) => {
new DateTime(arg, DateTimeZone.UTC).minusHours(4).toString("yyyy-mm-dd'T'HH:mm:s.SS'Z'")}
val sqlfunc = udf(coder)
val tDf = df.withColumn("newTime", sqlfunc(col("_c20")))
output = 2019-39-13T07:39:10.244Z
desired output = 2019-06-13T07:39:10.244Z
Kindly advise how should I proceed. Thanks in advance
There is a typo in your format string when creating the output.
Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'.
mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.

How to get week start date in scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.

How to select 13 digit timestamp column from parquet file,convert it to date and store as a data frame?

Since I am newbie to Apache spark and Scala methods, I want to perform the following needs.
-Read specific column from parquet file(13 Digit timestamp).
-Convert the timestamp to ordinary date format(yyyy-MM-dd HH:mm:ss).
-Store the result as another column in dataset.
I can read the timestamp using the following code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Test {
def main(args: Array[String]){
val conf=new SparkConf().setAppName("TEST_APP").setMaster("local")
val sc=new SparkContext(conf)
val sqlcon=new SQLContext(sc)
val Testdata = sqlcon.read.parquet("D:\\TestData.parquet")
val data_eve_type_end=Testdata.select(Testdata.col("heading.timestamp")).where(Testdata.col("status").equalTo("Success")).toDF("13digitTime")
}
}
and I tried to convert the timestamp using the reference link below
[https://stackoverflow.com/a/54354790/9493078]
But it doesn't working for me.I don't know actually whether I am fetched the data into a dataframe correctly or not.Anyway that makes an output as a table with columnname 13digitTime and values as some numbers with size 13 digit.
When I am trying to do code from link mentioned above it shows the error as
WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '(`13digitTime` / 1000000)' due to data type mismatch:
I am expecting for data frame with 2 columns in which one should contain the 13 digit timestamp and other should contain converted time from 13 digit to general date format(yyyy-MM-dd HH:mm:ss).
I wish to kindly get a solution,Thanks in advance.
sqlcon.read.parquet will return a dataframe itself. All you need to do is add a new column using withcolumn method. This should work.
val data_eve_type_end = Testdata.withColumn("13digitTime", from_unixtime($"heading.timestamp"))
I updated my code like this in which the 13 digit unix time converted into 10 digit by dividing by 1000 and cast it to tiimestamp.
val date_conv=data_eve_type_end.select(col("timestamp_value").as("UNIX TIME"),from_unixtime(col("timestamp_value")/1000).cast("timestamp").as("GENERAL TIME"))
and output is like
+-------------+-------------------+
| UNIX TIME| GENERAL TIME|
+-------------+-------------------+
|1551552902793| 2019-03-0 6:55:02|

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.