Extract Dates and compare them from string in Scala - scala

I am trying to extract date from string and compare them. I am new to Scala. The string : Some(Date: Tue, 14 Aug 2018 20:57:42 GMT)Some(Last-Modified: Tue, 14 Aug 2018 20:57:24 GMT) I wish to comapare Date and Last Modified

Extract the Dates if working with Option
There are several Scala wrappers for the Java Time API but the example below just uses the Java API directly.
val someDate: Option[String] = Some("Date: Tue, 14 Aug 2018 20:57:42 GMT")
val someLastMod: Option[String] = Some("Last-Modified: Tue, 14 Aug 2018 20:57:24 GMT")
The we extract the meaningful date substrings ie. we remove the "Date: "
val dateStr = someDate.get.split("^[\\w\\-]+:")(1).trim
val lastModStr = someLastMod.get.split("^[\\w\\-]+:")(1).trim
You should note that the above uses get() which assumes you can guarantee you will always have a Some and never a None. You should read up on working with Option in Scala if you don't understand this point.
Extract the Dates if working with a String
val data = "Some(Date: Tue, 14 Aug 2018 20:57:42 GMT)Some(Last-Modified: Tue, 14 Aug 2018 20:57:24 GMT)"
First we extract just the string dates we are interested in. The following expression uses split to create an array of strings, which we filter over to remove any empty strings before finally mapping over whats left and using take to remove the trailing parenthesis )
val dates = data.split("Some\\([\\w\\-]+:*\\s").filter(_.nonEmpty).map(_.take(29))
// dates: Array[String] = Array(Tue, 14 Aug 2018 20:57:42 GMT, Tue, 14 Aug 2018 20:57:24 GMT)
Now we extract each date string from the array.
val dateStr = dates(0)
val lastModStr = dates(1)
Now use the Java Time API to do comparisons.
Now we start to use the Java time API. First you need to import the Java packages.
import java.time._
import java.time.format._
Now create a formatter to match your DateTime pattern in order to convert the Strings to LocalDateTime instances.
val formatter = DateTimeFormatter.ofPattern("EEE, d MMM yyyy HH:mm:ss z")
val date = LocalDateTime.parse(dateStr, formatter)
val lastMod = LocalDateTime.parse(lastModStr, formatter)
Do some comparisons using the LocalDateTime API.
date.isBefore(lastMod)
date.isAfter(lastMod)
Check out the LocalDateTime docs for more ways to compare them.
Consider this
Will the format for the Dates always be in the same pattern? If not, you will need to think about how you will handle different patterns otherwise you will run into runtime exceptions (DateTimeParseException). Read more in the docs

Are you really trying to parse data that looks like this?
val badString = "Some(Date: Tue, 14 Aug 2018 20:57:42 GMT)Some(Last-Modified: Tue, 14 Aug 2018 20:57:24 GMT)"
Whoever thought that might be a reasonable way to represent data should go back to school (grade school). But it can be done. First let's try to segregate the data elements we're interested in, and remove some of the cruft along the way.
val inArray :Array[String] = badString.split("Some[^:]+: ")
//inArray: Array[String] = Array("", "Tue, 14 Aug 2018 20:57:42 GMT)", "Tue, 14 Aug 2018 20:57:24 GMT)")
Next we need to describe the date/time format that we're dealing with. Note that we have to account for a trailing paren ) in the data.
import java.time.format.DateTimeFormatter
val dtFormatter = DateTimeFormatter.ofPattern("E, dd MMM yyyy HH:mm:ss z)")
Now we can turn all the good data into Java LocalDateTime elements. Any Array elements that don't match the DateTimeFormatter pattern are removed.
import util.Try
import java.time.LocalDateTime
val dates :Array[LocalDateTime] = inArray.flatMap{ dateStr =>
Try(LocalDateTime.parse(dateStr.trim, dtFormatter)).toOption
}
So now you can extract the dates, if any, from the dates array and compare them using the LocalDateTime API.

Related

Scala: Parse timestamp using spark 3.1.2

I have an Excel-reader, where I put the results in sparks dataframes. I have problems with parsing the timestamps.
I have timestamps as strings like Wed Dec 08 10:49:59 CET 2021. I was using spark-sql version 2.4.5 and everything worked fine until I recently updated to version 3.1.2.
Please find some minimal code below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val oldfmt: String = "E MMM dd HH:mm:ss z yyyy"
val ttdf = Seq(ts)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"), fmt = oldfmt))
ttdf.show()
Running this code with spark version 2.4.5 works like expected and produces the following output:
+--------------------+-------------------+
| theTimestampColumn| parsedTime|
+--------------------+-------------------+
|Wed Dec 08 20:49:...|2021-12-08 20:49:59|
+--------------------+-------------------+
Now, executing the same code, just with spark version 3.1.2, results in the following error:
Exception in thread "main" org.apache.spark.SparkUpgradeException:
You may get a different result due to the upgrading of Spark 3.0:
Fail to recognize 'E MMM dd HH:mm:ss z yyyy' pattern in the DateTimeFormatter.
1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
(clickable link: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)
This website doesn't help me further. I don't find any mistakes in my formatstring.
The symbol E represents the day-of-week as text like Tue; Tuesday.
The symbol M represents the month-of-year like 7; 07; Jul; July. The symbols H,m,s,y are hours, minutes, seconds or years, respectively. The symbol z denotes the time-zone name like Pacific Standard Time; PST.
Do I miss something obvious here?
Any help will be really appreciated. Thank you in advance.
You can use E only for datetime formatting and not for parsing, as stated in datetime pattern documentation:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
If you want to apply behavior of Spark version <3.0, you can set spark.sql.legacy.timeParserPolicy option to LEGACY:
sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
And if you don't want to change spark configuration, you can remove the characters representing day with substr SQL function:
import org.apache.spark.sql.functions.{col, to_timestamp, expr}
val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val fmt: String = "MMM dd HH:mm:ss z yyyy"
val ttdf = Seq(ts)
.toDF("theTimestampColumn")
.withColumn("preparedTimestamp", expr("substr(theTimestampColumn, 5, length(theTimestampColumn))"))
.withColumn("parsedTime", to_timestamp(col("preparedTimestamp"), fmt = fmt))
.drop("preparedTimestamp")

How to convert date from one format to another? Need to get date to "Tue Apr 3 22:10:06 2018" format

How to convert "2018-04-03 22:10:06" to "Tue Apr 3 22:10:06 2018"? Obviously not those specific dates but that format.
I found this solution:
How to convert date format from dd/MM/YYYY to YYYY-MM-dd in swift
but I am unable to get it to the exact format.
So, I just threw this into Playgrounds
let inFormat = DateFormatter()
inFormat.dateFormat = "yyyy-MM-dd HH:mm:ss"
let date = inFormat.date(from: "2018-04-03 22:10:06")
let outFormat = DateFormatter()
outFormat.dateFormat = "E MMM d HH:mm:ss yyyy"
outFormat.string(from: date!)
It eventually outputs Tue Apr 3 22:10:06 2018
The formats were initial referenced from nsdateformatter.com
This, of course, will use the current system TimeZone and Locale, so you may want to do some more investigation into that

Use moment to check if date is from 2 minutes ago

I tried this but I can't make it work.
I have an array of dates like this: ['5/15/2017, 2:59:06 PM', '5/15/2017, 2:59:16 PM', ...]
And I want to filter it and get the dates from the last 2 minutes.
This is what I'm doing:
const twoMinutesAgo = moment().subtract(2, 'minutes')
myArray.filter(stat => moment(stat.date).isAfter(twoMinutesAgo))
but this is returning always true.
I saw that twoMinutesAgo is Mon May 15 2017 14:57:09 GMT-0700 (PDT)
while moment(stat.date) is Mon Mar 05 2018 14:59:06 GMT-0800 (PST)
But I'm not sure if that has something to do with it. Any ideas what I'm doing wrong?
Part of the issue is that you're only looking at the hour of the day, but you're actually comparing absolute dates.
The other issue you have is moment isn't correctly parsing your date strings. To rectify this, you can provide a format string to instruct moment on how to parse apart the important bits:
const PARSE_FORMAT = 'M/D/YYYY, H:mm:ss A';
const twoMinutesAgo = moment().subtract(2, 'minutes')
myArray.filter(stat => moment(stat.date, PARSE_FORMAT).isAfter(twoMinutesAgo));

How to assign a static date to a variable in Scala

We are assign a string to a variable integer or long etc.. to a variable like
var str:String="This is String"
var inte:Int=1
like these
var dat:Date=new Date(22/05/2013)
this is possible?..
but output is
Thu Jan 01 05:30:00 IST 1970
How to assign a static date to a variable?..
scala> 22/05/2013
res0: Int = 0
You are calling Date constructor with an Int argument. It's a number of milliseconds since the standard base time known as "the epoch", namely January 1, 1970, 00:00:00 GMT. So you are getting standard base time.
You should use DateFormat.parse since all other Date constructors are deprecated.
From the question, I couldn't guess what you are trying to achieve..
Perhaps, this is what you are looking for..
import java.util.Date
import java.text.SimpleDateFormat
val format = new SimpleDateFormat("dd/MM/yyyy")
var date = format.parse("22/05/2013")
// date : java.util.Date = Wed May 22 00:00:00 IST 2013

Conversion between date formats

I have a spreadsheet which we use to log our work. I have read the date's column from the sheet into an array so that I can perform actions/calculations on them. This did not work for me as I noticed in the debugger that the array which holds the dates have values different formats of the date values. For instance, some display as "16/11/2012" and some as (new Date(1355184000000)).
Can someone point a way to convert them all to a unified format so that I can work with them?
Thanks
How do you want your dates to be shown? Do you want them to be 'date objects' or strings? the value you show in your code (new Date(1355184000000)) correspond to Mon Dec 10 16:00:00 PST 2012
You can check that by using Logger.log(new Date(1355184000000))
On the contrary "16/11/2012" is most probably not a date but a string...(note : strange that you use a "day/month/year" sequence since I saw on your profile you are in UK, I thought you'd use mm/dd/yyyy instead).
Since you said you need to make some calculations on these items I guess that they all should be converted to date objects for using them in you script.
I'd suggest you look at some documentation on date object to see exactly how this should be done without generating errors. Don't forget that dates in javascript are always dates and time in hh:mm:ss and milliseconds. The integer value you saw was the number of milliseconds since January the first in 1970 ;-)
You could also do a search on dates in this forum and find quite a lot of interresting informations.
Here is a small function to illustrate :
function playWithTime(){
Logger.log('ref date = '+new Date(0))
var example = "june 30, 2013 23:59:00"
Logger.log(example+' = '+ new Date("june 30, 2013 23:59:00"))
Logger.log(example+' = '+ new Date("june 30, 2013 23:59:00").getTime()+' mS')
}
It will show this in the Logger :
ref date = Thu Jan 01 1970 01:00:00 GMT+0100 (CET)
june 30, 2013 23:59:00 = Sun Jun 30 2013 23:59:00 GMT+0200 (CEST)
june 30, 2013 23:59:00 = 1372629540000 mS
btw, note that the Logger returns values in different ways, depending on the daylight saving in your timezone... I'm in Belgium and june is in 'summer' time (CEST). It can also be shown in PDT or PST which is the timezone of Google servers. You can't rely on the logger to be constant (!!) but that's another story ;-)
EDIT : If your date strings are in the form dd/mm/yyyy then you should probably reorder it like in this code :
function playWithTime2(){
Logger.log('original date string in UK format = 16/11/2012')
var d = "16/11/2012".split('/');
var d_ordered=d[1]+'/'+d[0]+'/'+d[2]
Logger.log('becomes '+d_ordered+' = '+new Date(d_ordered))
}
Which returns
original date string in UK format = 16/11/2012
becomes 11/16/2012 = Fri Nov 16 2012 00:00:00 GMT+0100 (CET)