Spark 2.2 extracting date not working from unix_timestamp - pyspark

In Spark 2.2 extracting date not working from unix_timestamp
Input Data:
+-------------------------+
|UPDATE_TS |
+-------------------------+
|26NOV2009:03:27:01.154410|
|24DEC2012:00:47:46.805710|
|02MAY2013:00:45:33.233844|
|21NOV2014:00:33:39.350140|
|10DEC2013:00:30:30.532446|
I tried following approaches but output Im getting as null
Query Tired:
Spark sql
sqlContext.sql("select from_unixtime(unix_timestamp(UPDATE_TS,'ddMMMyyyy:HH:MM:SS.ssssss'), 'yyyy') as new_date from df_vendor_tab").show()
DSL:
df_id.withColumn('part_date', from_unixtime(unix_timestamp(df_id.UPDATE_TS, "ddMMMyyyy:HH:MM:SS.sss"), "yyyy"))
expected output:
2009
2012
2013
2014
2013

You're using the incorrect format string. Capital M is for month. Lower case m is for minute.
The following would work:
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date
df_id.withColumn(
'part_date',
from_unixtime(unix_timestamp(df_id.UPDATE_TS, "ddMMMyyyy:HH:mm:SS.SSSSSS"), "yyyy")
).show(truncate=False)
#+-------------------------+---------+
#|UPDATE_TS |part_date|
#+-------------------------+---------+
#|26NOV2009:03:27:01.154410|2009 |
#|24DEC2012:00:47:46.805710|2012 |
#|02MAY2013:00:45:33.233844|2013 |
#|21NOV2014:00:33:39.350140|2014 |
#|10DEC2013:00:30:30.532446|2013 |
#+-------------------------+---------+

Simple spark-sql is working fine with unix_timestamp and from_unixtime
sqlContext.sql("Select from_unixtime(unix_timestamp('26NOV2009:03:27:01.154410', 'ddMMMyyyy'), 'yyyy')").show
Output:
+----+
| _c0|
+----+
|2009|
+----+
Since you are looking for extracting year, i have not considered hour, min, secs...

Related

How to convert short date D-MMM-yyyy using PySpark

Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02
D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+

Convert String to Timestamp in Spark (Hive) and the datetime is invalid

I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| null|
+--------------------------------------------------+
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| 2022-03-27 03:15:00|
+--------------------------------------------------+
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp,
to_timestamp and to_date will fail if the specified datetime pattern
is invalid. In Spark 3.0 or earlier, they result NULL. (1)
Let's consider the following dataframe with a column called ts.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1+, we can use to_timestamp which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
+-------------------+-------------------+
| ts| time|
+-------------------+-------------------+
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
+-------------------+-------------------+
I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG

Convert date time to unix_timestamp Scala

I need to get the minimum value from the Spark data frame and transform it.
Currently, I just get this value and transform it using DateTime, however, I need it in the unix_timestamp format as the result. So how can I convert DateTime to unix_timestamp either using Scala functions or Spark functions?
Here is my current code which for now returns DateTime:
val minHour = new DateTime(df.agg(min($"event_ts"))
.as[Timestamp].collect().head))
.minusDays(5)
.withTimeAtStartOfDay())
I tried using Spark functions as well but I was not able to switch timestamp to start time of day (which can be achieved using DateTime withTimeAtStartOfDay function):
val minHour = new DateTime(df.agg(min($"event_ts").alias("min_ts"))
.select(unix_timestamp(date_sub($"min_ts", 5)))
.as[Long].collect().head)
date_sub will cast your timestamp to a date, so the time will be automatically shifted to the start of day.
df.show
+-------------------+----------+
| event_ts|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
df.agg(
min($"event_ts").alias("min_ts")
).select(
unix_timestamp(date_sub($"min_ts", 5)).alias("min_ts_unix")
).withColumn(
"min_ts", $"min_ts_unix".cast("timestamp")
).show
+-----------+-------------------+
|min_ts_unix| min_ts|
+-----------+-------------------+
| 1493164800|2017-04-26 00:00:00|
+-----------+-------------------+

substract current date with another date in dataframe scala

First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks
You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.