How to get week start date in scala - scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.

Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+

You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.

tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.

Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency

The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.

Related

Spark date format issue

I have observed weird behavior in spark date formatting. Actually I need to convert the date yy to yyyy. After date conversion it should be 20yy in date
I have tried as below, it failing after 2040 year.
import org.apache.spark.sql.functions._
val df= Seq(("06/03/35"),("07/24/40"), ("11/15/43"), ("12/15/12"), ("11/15/20"), ("12/12/22")).toDF("Date")
df.withColumn("newdate", from_unixtime(unix_timestamp($"Date", "mm/dd/yy"), "mm/dd/yyyy")).show
+--------+----------+
| Date| newdate|
+--------+----------+
| 06/3/35|06/03/2035|
|07/24/40|07/24/2040|
|11/15/43|11/15/1943| // Here year appended with 19
|12/15/12|12/15/2012|
|11/15/20|11/15/2020|
|12/12/22|12/12/2022|
+--------+----------+
Why this behavior, Is there any date utility function that I can use directly without appending 20 to string date
Parsing 2-digit year strings is subject to some relative interpretation that is documented in the SimpleDateFormat docs:
For parsing with the abbreviated year pattern ("y" or "yy"), SimpleDateFormat must interpret the abbreviated year relative to some century. It does this by adjusting dates to be within 80 years before and 20 years after the time the SimpleDateFormat instance is created. For example, using a pattern of "MM/dd/yy" and a SimpleDateFormat instance created on Jan 1, 1997, the string "01/11/12" would be interpreted as Jan 11, 2012 while the string "05/04/64" would be interpreted as May 4, 1964.
So, 2043 being more than 20 years away, the parser uses 1943 as documented.
Here's one approach that uses a UDF that explicitly calls set2DigitYearStart on a SimpleDateFormat object before parsing the date (I picked 1980 just as an example):
def parseDate(date: String, pattern: String): Date = {
val format = new SimpleDateFormat(pattern);
val cal = Calendar.getInstance();
cal.set(Calendar.YEAR, 1980)
val beginning = cal.getTime();
format.set2DigitYearStart(beginning)
return new Date(format.parse(date).getTime);
}
And then:
val custom_to_date = udf(parseDate _);
df.withColumn("newdate", custom_to_date($"Date", lit("mm/dd/yy"))).show(false)
+--------+----------+
|Date |newdate |
+--------+----------+
|06/03/35|2035-01-03|
|07/24/40|2040-01-24|
|11/15/43|2043-01-15|
|12/15/12|2012-01-15|
|11/15/20|2020-01-15|
|12/12/22|2022-01-12|
+--------+----------+
Knowing your data, you would know which value to pick for the parameter to set2DigitYearStart()

spark scala how can I calculate days since 1970-01-01

Looking for scala code to replicate https://www.epochconverter.com/seconds-days-since-y0
I have a spark streaming job reading the avro message. The message has a column of type int and holds Days Since 1970-01-01. I want to convert that to date.
dataFrame.select(from_avro(col("Value"), valueRegistryConfig) as 'value)
.select("value.*")
.withColumn("start_date",'start_date)
start_date is holding an integer value like 18022 i.e Days Since 1970-01-01. I want to convert this value to a date
18022 - > Sun May 05 2019
Use default date as 1970-01-01 and pass number of days to date_add function.
This will give you date but will be 1 day additional so you do minus 1.
Something like this:
var dataDF = Seq(("1970-01-01",18091),("1970-01-01",18021),("1970-01-01",18022)).toDF("date","num")
dataDF.select(
col("date"),
expr("date_add(date,num-1)").as("date_add")).show(10,false)
+----------+----------+
|date |date_add |
+----------+----------+
|1970-01-01|2019-07-13|
|1970-01-01|2019-05-04|
|1970-01-01|2019-05-05|
+----------+----------+

day of week date format string java inside spark

val df = Seq("2019-07-30", "2019-08-01").toDF
val dd = df.withColumn("value", to_date('value))
dd.show(false)
according to the docs https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
F is the format string if I need to see the day of the week in month. And
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)
+----------+--------+----------+
|value |dow |dow_number|
+----------+--------+----------+
|2019-07-30|Tuesday |5 |
|2019-08-01|Thursday|1 |
+----------+--------+----------+
gives only the day of the week in the month, not the day of the week.
Which format string gives me the day of the week as a number /integer?
Obviously, I could use: http://www.java2s.com/Tutorials/Java/Data_Type_How_to/Date/Get_day_of_week_int_value_and_String_value.htm
But do not want to go for a UDF / want to use the catalyst optimized date_format. So which date format string gives me the desired result?
As mentionned in the comments, you are looking for the "u" format.
Also, from spark 2.3.0 you might want to use dayofweek method, which is faster dayofweek documentation
your code is correct instead of "F" just use "u" like below
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)

Scala date format

I have a data_date that gives a format of yyyymmdd:
beginDate = Some(LocalDate.of(startYearMonthDay(0), startYearMonthDay(1),
startYearMonthDay(2)))
var Date = beginDate.get
.......
val data_date = Date.toString().replace("-", "")
This will give me a result of '20180202'
however, I need the result to be 201802 (yyyymm) for my usecase. I don't want to change the value of beginDate, I just want to change the data_date value to fit my usecase, how do I do that? is there a split function I can use?
Thanks!
It's not clear from the code snippet that you're using Spark, but the tags imply that, so I'll give an answer using Spark built-in functions. Suppose your DataFrame is called df with date column my_date_column. Then, you can simply use date_format
scala> import org.apache.spark.sql.functions.date_format
import org.apache.spark.sql.functions.date_format
scala> df.withColumn("my_new_date_column", date_format($"my_date_column", "YYYYMM")).
| select($"my_new_date_column").limit(1).show
// for example:
+------------------+
|my_new_date_column|
+------------------+
| 201808|
+------------------+
The way to to it with DateTimeFormatter.
val formatter = DateTimeFormatter.ofPattern("YMM")
val data_date = Date.format(foramatter)
I recommend you to read through DateTimeFormatter docs, so you can format date the way you want.
You can do this by only taking the first 6 characters of the resulting string.
i.e.
val s = "20180202"
s.substring(0, 6) // returns "201802"

Date diff of ISO 8601 timestemp strings in Spark

I have two datetime strings in ISO 8601 format:
2017-05-30T09:15:06.050298Z
2017-05-30T09:15:06.054939Z
I want the time difference between above two strings using Scala in Spark environment.
As you said in the comments you're using Joda-Time, here's an answer using it.
You said that you're calling daysBetween. But both dates are in the same day, so the result will always be zero. To get the difference between the dates with millisecond precision, just subtract the millis value from both DateTime objects:
import org.joda.time.DateTime
val s1 = "2017-05-30T09:15:06.050298Z"
val s2 = "2017-05-30T09:15:06.054939Z"
val diffInMillis = DateTime.parse(s2).getMillis() - DateTime.parse(s1).getMillis()
The diffInMillis will be 4 - the first date's fraction-of-second is 050298 and the second's is 054939, but joda's DateTime has milliseconds precision, so the last 3 digits are discarded. You can check that by doing:
println(DateTime.parse(s1))
println(DateTime.parse(s2))
This will output:
2017-05-30T09:15:06.050Z
2017-05-30T09:15:06.054Z
As you can see, the difference between the dates is 4 milliseconds.
New Java Date/Time API
Joda-Time is in maintainance mode and is being replaced by the new APIs, so I don't recommend start a new project with it. Even in joda's website it says: "Note that Joda-Time is considered to be a largely “finished” project. No major enhancements are planned. If using Java SE 8, please migrate to java.time (JSR-310)."
If you have the new java.time API available (JDK >= 1.8), you can also use it. If java.time classes are not available (JDK <= 1.7), you can try the scala time, which is based on the ThreeTen Backport, a great backport for Java 8's new date/time classes.
The code below works for both.
The only difference is the package names (in Java 8 is java.time and in ThreeTen Backport (or Android's ThreeTenABP) is org.threeten.bp), but the classes and methods names are the same.
The difference is that this API has nanoseconds precision, so you can get the difference between the dates in nanoseconds.
import java.time.Instant
import java.time.temporal.ChronoUnit
val s1 = "2017-05-30T09:15:06.050298Z"
val s2 = "2017-05-30T09:15:06.054939Z"
// difference in nanoseconds
val diffInNanos = ChronoUnit.NANOS.between(Instant.parse(s1), Instant.parse(s2))
The value of diffInNanos is 4641000. If you still want this value in milliseconds, you can divide it by 1000000, or use ChronoUnit.MILLIS instead of ChronoUnit.NANOS.
Nanoseconds with LocalDateTime of Java 8
As Spark does not support date diff above Seconds, we need to create a UDF for Millis or Nanos.
Date time related imports
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoField;
Create UDF to date diff by nanoseconds
spark.udf.register("date_diff_nano", (d1: String, d2: String) =>
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.n'Z'")
val dt1 = LocalDateTime.parse(d1, dtFormatter)
val dt2 = LocalDateTime.parse(d2, dtFormatter)
(dt1.getLong(ChronoField.NANO_OF_DAY) - dt2.getLong(ChronoField.NANO_OF_DAY))
)
Check: help in building DateTimeFormatter pattern
By modifying ChronoField.NANO_OF_DAY to ChronoField.MICRO_OF_DAY
in UDF last line, we can get the date diff in micro sec also.
Now, use the UDF on any DataFrame/DataSet object.
import spark.implicits._ //to use $-notation on columns
// create the dataframe df
val df = ...
val resultDf = df.withColumn("date_diff", date_diff_nano($"dt1", $"dt2"))
Here dt1 and dt2 are DateTime columns in df
Seconds diff with unix_timestamp of Spark SQL
Use Spark SQL predefined unix_timestamp(date, format) function to convert a date to seconds of the day (But Java SimpleDateFormat can support parsing up to Milliseconds), then you can do Date diff with Spark SQL using unix_timestamp.
import org.apache.spark.sql.functions.unix_timestamp
val resultDf = df.withColumn("date_diff_sec",
(unix_timestamp($"dt1"), unix_timestamp($"dt2")))
Days diff b/w two dates using datediff
It accepts the datetime value of following formats
java.sql.Timestamp
java.sql.Date
String format of 'YYYY-MM-dd'
String format of 'YYYY-MM-dd HH:mm:ss'
import org.apache.spark.sql.functions.datediff
val resultDf = df.withColumn("date_diff_days", datediff($"dt1", $"dt2"))
You can use the xml date parser, since it must adhere to ISO-8601:
val t1 = javax.xml.bind.DatatypeConverter.parseDateTime("2017-05-30T09:15:06.050298Z")
val t2 = javax.xml.bind.DatatypeConverter.parseDateTime("2017-05-30T09:15:06.054939Z")
val diff = t1.getTimeInMillis - t2.getTimeInMillis