spark scala how can I calculate days since 1970-01-01 - scala

Looking for scala code to replicate https://www.epochconverter.com/seconds-days-since-y0
I have a spark streaming job reading the avro message. The message has a column of type int and holds Days Since 1970-01-01. I want to convert that to date.
dataFrame.select(from_avro(col("Value"), valueRegistryConfig) as 'value)
.select("value.*")
.withColumn("start_date",'start_date)
start_date is holding an integer value like 18022 i.e Days Since 1970-01-01. I want to convert this value to a date
18022 - > Sun May 05 2019

Use default date as 1970-01-01 and pass number of days to date_add function.
This will give you date but will be 1 day additional so you do minus 1.
Something like this:
var dataDF = Seq(("1970-01-01",18091),("1970-01-01",18021),("1970-01-01",18022)).toDF("date","num")
dataDF.select(
col("date"),
expr("date_add(date,num-1)").as("date_add")).show(10,false)
+----------+----------+
|date |date_add |
+----------+----------+
|1970-01-01|2019-07-13|
|1970-01-01|2019-05-04|
|1970-01-01|2019-05-05|
+----------+----------+

Related

to_date in selectexpr of pyspark is truncating date time to year by default. how to avoid this?

I have a requirement where I derive the year to be loaded and then have to load the first day and last day of that year in date format to a table.
Here is what I'm doing:
boy = str(nxt_yr)+'01-01'
eoy = str(nxt_yr)+'12-31'
df_final = df_demo.selectExpr("to_date('{}','yyyy-MM-dd') as strt_dt".format(boy),"to_date('{}','yyyy-MM-dd') as end_dt".format(eoy))
spark.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
df_final.show(1)
this is giving me 2023-01-01 in both the fields in date datatype.
Is this expected behavior and if yes is there any workaround?
Note: I tried hardcoding the date as 2022-11-30 also in the code but still received the beginning of the year in the output.
Its working as expected , additionally you are missing a - within your dates you create for conversions
nxt_yr = 2022
boy = str(nxt_yr)+'-01-01'
# |
# /|\
# |
# |
eoy = str(nxt_yr)+'-12-31'
sql.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
sql.sql(f"""
SELECT
to_date('{boy}','yyyy-MM-dd') as strt_dt
,to_date('{eoy}','yyyy-MM-dd') as end_dt
"""
).show()
+----------+----------+
| strt_dt| end_dt|
+----------+----------+
|2022-01-01|2022-12-31|
+----------+----------+

Changing date format in Spark returns incorrect result

I am trying to convert a string type date from a csv file to date format first and then to convert that to a particularly expected date format. While doing so, for a row (for the first time) I saw the date format change is changing the year value.
scala> df1.filter($"pt" === 2720).select("`date`").show()
+----------+
| date|
+----------+
|24/08/2019|
|30/12/2019|
+----------+
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), YYYY-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2020-12-30|
+------------------------------------------------------+
As you can see above, in the above, the two rows of data has 24/08/2019 and 30/12/2019 respectively, but after explicit type casting and date format change, it gives 2019-08-24 (which is correct) and 2020-12-30 (incorrect and unexpected).
Why does this problem occur and how can this be best avoided?
I solved this issue by changing the capital YYYY to yyyy in the expected date format parameter.
So, instead of
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
I am now doing
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
This is because, as per this Java's SimpleDateFormat, the capital Y is parsed as week year where as small letter y is parsed as year.
So, now, when I ran with small y in the year's field, I get the expected result:
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), yyyy-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2019-12-30|
+------------------------------------------------------+

How to get week start date in scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.

day of week date format string java inside spark

val df = Seq("2019-07-30", "2019-08-01").toDF
val dd = df.withColumn("value", to_date('value))
dd.show(false)
according to the docs https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
F is the format string if I need to see the day of the week in month. And
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)
+----------+--------+----------+
|value |dow |dow_number|
+----------+--------+----------+
|2019-07-30|Tuesday |5 |
|2019-08-01|Thursday|1 |
+----------+--------+----------+
gives only the day of the week in the month, not the day of the week.
Which format string gives me the day of the week as a number /integer?
Obviously, I could use: http://www.java2s.com/Tutorials/Java/Data_Type_How_to/Date/Get_day_of_week_int_value_and_String_value.htm
But do not want to go for a UDF / want to use the catalyst optimized date_format. So which date format string gives me the desired result?
As mentionned in the comments, you are looking for the "u" format.
Also, from spark 2.3.0 you might want to use dayofweek method, which is faster dayofweek documentation
your code is correct instead of "F" just use "u" like below
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you
Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.