I have observed weird behavior in spark date formatting. Actually I need to convert the date yy to yyyy. After date conversion it should be 20yy in date
I have tried as below, it failing after 2040 year.
import org.apache.spark.sql.functions._
val df= Seq(("06/03/35"),("07/24/40"), ("11/15/43"), ("12/15/12"), ("11/15/20"), ("12/12/22")).toDF("Date")
df.withColumn("newdate", from_unixtime(unix_timestamp($"Date", "mm/dd/yy"), "mm/dd/yyyy")).show
+--------+----------+
| Date| newdate|
+--------+----------+
| 06/3/35|06/03/2035|
|07/24/40|07/24/2040|
|11/15/43|11/15/1943| // Here year appended with 19
|12/15/12|12/15/2012|
|11/15/20|11/15/2020|
|12/12/22|12/12/2022|
+--------+----------+
Why this behavior, Is there any date utility function that I can use directly without appending 20 to string date
Parsing 2-digit year strings is subject to some relative interpretation that is documented in the SimpleDateFormat docs:
For parsing with the abbreviated year pattern ("y" or "yy"), SimpleDateFormat must interpret the abbreviated year relative to some century. It does this by adjusting dates to be within 80 years before and 20 years after the time the SimpleDateFormat instance is created. For example, using a pattern of "MM/dd/yy" and a SimpleDateFormat instance created on Jan 1, 1997, the string "01/11/12" would be interpreted as Jan 11, 2012 while the string "05/04/64" would be interpreted as May 4, 1964.
So, 2043 being more than 20 years away, the parser uses 1943 as documented.
Here's one approach that uses a UDF that explicitly calls set2DigitYearStart on a SimpleDateFormat object before parsing the date (I picked 1980 just as an example):
def parseDate(date: String, pattern: String): Date = {
val format = new SimpleDateFormat(pattern);
val cal = Calendar.getInstance();
cal.set(Calendar.YEAR, 1980)
val beginning = cal.getTime();
format.set2DigitYearStart(beginning)
return new Date(format.parse(date).getTime);
}
And then:
val custom_to_date = udf(parseDate _);
df.withColumn("newdate", custom_to_date($"Date", lit("mm/dd/yy"))).show(false)
+--------+----------+
|Date |newdate |
+--------+----------+
|06/03/35|2035-01-03|
|07/24/40|2040-01-24|
|11/15/43|2043-01-15|
|12/15/12|2012-01-15|
|11/15/20|2020-01-15|
|12/12/22|2022-01-12|
+--------+----------+
Knowing your data, you would know which value to pick for the parameter to set2DigitYearStart()
Related
I am trying to convert a string type date from a csv file to date format first and then to convert that to a particularly expected date format. While doing so, for a row (for the first time) I saw the date format change is changing the year value.
scala> df1.filter($"pt" === 2720).select("`date`").show()
+----------+
| date|
+----------+
|24/08/2019|
|30/12/2019|
+----------+
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), YYYY-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2020-12-30|
+------------------------------------------------------+
As you can see above, in the above, the two rows of data has 24/08/2019 and 30/12/2019 respectively, but after explicit type casting and date format change, it gives 2019-08-24 (which is correct) and 2020-12-30 (incorrect and unexpected).
Why does this problem occur and how can this be best avoided?
I solved this issue by changing the capital YYYY to yyyy in the expected date format parameter.
So, instead of
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
I am now doing
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
This is because, as per this Java's SimpleDateFormat, the capital Y is parsed as week year where as small letter y is parsed as year.
So, now, when I ran with small y in the year's field, I get the expected result:
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), yyyy-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2019-12-30|
+------------------------------------------------------+
It formats 2020-01-27 00:00:00 of type timestamp as 2020-01-27 12:00:00 instead of 2020-01-27 00:00:00
import spark.sqlContext.implicits._
import java.sql.Timestamp
import org.apache.spark.sql.functions.typedLit
scala> val stamp = typedLit(new Timestamp(1580105949000L))
stamp: org.apache.spark.sql.Column = TIMESTAMP('2020-01-27 00:19:09.0')
scala> var df_test = Seq(5).toDF("seq").select(
| stamp.as("unixtime"),
| date_trunc("HOUR", stamp).as("date_trunc"),
| date_format(date_trunc("HOUR", stamp), "yyyy-MM-dd hh:mm:ss").as("hour")
| )
df_test: org.apache.spark.sql.DataFrame = [unixtime: timestamp, date_trunc: timestamp ... 1 more field]
scala> df_test.show
+-------------------+-------------------+-------------------+
| unixtime| date_trunc| hour|
+-------------------+-------------------+-------------------+
|2020-01-27 00:19:09|2020-01-27 00:00:00|2020-01-27 12:00:00|
+-------------------+-------------------+-------------------+
Your pattern should be yyyy-MM-dd HH:mm:ss.
date_format, according to its documentation, uses specifiers supported by java.text.SimpleDateFormat:
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
See SimpleDateFormat for valid date and time format patterns.
SimpleDateFormat's documentation can be found here
hh is used for "Hour in am/pm (1-12)". You're looking for the hour in day specifier, which is HH.
I have some columns with dates from a source files that look like 4/23/19
The 4 being the month, the 23 being the day and the 19 being 2019
How do I convert this to a timestamp in pyspark?
So far
def ParseDateFromFormats(col, formats):
return coalesce(*[to_timestamp(col, f) for f in formats])
df2 = df2.withColumn("_" + field.columnName, ParseDateFromFormats(df2[field.columnName], ["dd/MM/yyyy hh:mm", "dd/MM/yyyy", "dd-MMM-yy"]).cast(field.simpleTypeName))
There doesn't seem to be a date format that would work
The reason why your code didn't work might be cause you reversed days and months.
This works:
from pyspark.sql.functions import to_date
time_df = spark.createDataFrame([('4/23/19',)], ['dt'])
time_df.withColumn('proper_date', to_date('dt', 'MM/dd/yy')).show()
+-------+-----------+
| dt|proper_date|
+-------+-----------+
|4/23/19| 2019-04-23|
+-------+-----------+
I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.
val df = Seq("2019-07-30", "2019-08-01").toDF
val dd = df.withColumn("value", to_date('value))
dd.show(false)
according to the docs https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
F is the format string if I need to see the day of the week in month. And
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)
+----------+--------+----------+
|value |dow |dow_number|
+----------+--------+----------+
|2019-07-30|Tuesday |5 |
|2019-08-01|Thursday|1 |
+----------+--------+----------+
gives only the day of the week in the month, not the day of the week.
Which format string gives me the day of the week as a number /integer?
Obviously, I could use: http://www.java2s.com/Tutorials/Java/Data_Type_How_to/Date/Get_day_of_week_int_value_and_String_value.htm
But do not want to go for a UDF / want to use the catalyst optimized date_format. So which date format string gives me the desired result?
As mentionned in the comments, you are looking for the "u" format.
Also, from spark 2.3.0 you might want to use dayofweek method, which is faster dayofweek documentation
your code is correct instead of "F" just use "u" like below
dd.withColumn("dow", date_format('value, "EEEE")).withColumn("dow_number", date_format('value, "F")).show(false)