Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02
D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+
I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| null|
+--------------------------------------------------+
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| 2022-03-27 03:15:00|
+--------------------------------------------------+
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp,
to_timestamp and to_date will fail if the specified datetime pattern
is invalid. In Spark 3.0 or earlier, they result NULL. (1)
Let's consider the following dataframe with a column called ts.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1+, we can use to_timestamp which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
+-------------------+-------------------+
| ts| time|
+-------------------+-------------------+
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
+-------------------+-------------------+
I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);
I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG
I need to get the minimum value from the Spark data frame and transform it.
Currently, I just get this value and transform it using DateTime, however, I need it in the unix_timestamp format as the result. So how can I convert DateTime to unix_timestamp either using Scala functions or Spark functions?
Here is my current code which for now returns DateTime:
val minHour = new DateTime(df.agg(min($"event_ts"))
.as[Timestamp].collect().head))
.minusDays(5)
.withTimeAtStartOfDay())
I tried using Spark functions as well but I was not able to switch timestamp to start time of day (which can be achieved using DateTime withTimeAtStartOfDay function):
val minHour = new DateTime(df.agg(min($"event_ts").alias("min_ts"))
.select(unix_timestamp(date_sub($"min_ts", 5)))
.as[Long].collect().head)
date_sub will cast your timestamp to a date, so the time will be automatically shifted to the start of day.
df.show
+-------------------+----------+
| event_ts|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
df.agg(
min($"event_ts").alias("min_ts")
).select(
unix_timestamp(date_sub($"min_ts", 5)).alias("min_ts_unix")
).withColumn(
"min_ts", $"min_ts_unix".cast("timestamp")
).show
+-----------+-------------------+
|min_ts_unix| min_ts|
+-----------+-------------------+
| 1493164800|2017-04-26 00:00:00|
+-----------+-------------------+
First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks
You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.