Spark scala - calculating dynamic timestamp interval - scala

have dataframe with a timestamp column (timestamp type) called "maxTmstmp" and another column with hours, represented as integers called "WindowHours". I would like to dynamically subtract timestamp and integer columns to get lower timestamp.
My data and desired effect ("minTmstmp" column):
+-----------+-------------------+-------------------+
|WindowHours| maxTmstmp| minTmstmp|
| | |(maxTmstmp - Hours)|
+-----------+-------------------+-------------------+
| 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
| 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
| 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
| 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
+-----------+-------------------+-------------------+
root
|-- WindowHours: integer (nullable = true)
|-- maxTmstmp: timestamp (nullable = true)
I have already found an expressions with hours interval solution, but it isn't dynamic. Code below doesn't work as intended.
standards.
.withColumn("minTmstmp", $"maxTmstmp" - expr("INTERVAL 10 HOURS"))
.show()
Operate on Spark 2.4 and scala.

One simple way would be to convert maxTmstmp to unix time, subtract the value of WindowHours in seconds from it, and convert the result back to Spark Timestamp, as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2016-01-01 23:00:00")),
(2, Timestamp.valueOf("2016-03-01 12:00:00")),
(8, Timestamp.valueOf("2016-03-05 20:00:00")),
(24, Timestamp.valueOf("2016-04-12 11:00:00"))
).toDF("WindowHours", "maxTmstmp")
df.withColumn("minTmstmp",
from_unixtime(unix_timestamp($"maxTmstmp") - ($"WindowHours" * 3600))
).show
// +-----------+-------------------+-------------------+
// |WindowHours| maxTmstmp| minTmstmp|
// +-----------+-------------------+-------------------+
// | 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
// | 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
// | 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
// | 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
// +-----------+-------------------+-------------------+

Related

newly created column shows null values in pyspark dataframe

I want to add a column calculating the difference in time between two two timestamp values. In order to do that I first add a column with the current datetime which is define as current_datetime here:
import datetime
#define current datetime
now = datetime.datetime.now()
#Getting Current date and time
current_datetime=now.strftime("%Y-%m-%d %H:%M:%S")
print(now)
then I want to add current_datetime as column value to the df and calculate the diff
import pyspark.sql.functions as F
productsDF = productsDF\
.withColumn('current_time', when(col('Quantity')>1, current_datetime))\
.withColumn('time_diff',\
(F.unix_timestamp(F.to_timestamp(F.col('current_time')))) -
(F.unix_timestamp(F.to_timestamp(F.col('Created_datetime'))))/F.lit(3600)
)
The output however is only null values.
productsDF.select('current_time','Created_datetime','time_diff').show()
+------------+-------------------+---------+
|current_time| Created_datetime|time_diff|
+------------+-------------------+---------+
| null|2019-10-12 17:09:18| null|
| null|2019-12-03 07:02:07| null|
| null|2020-01-16 23:10:08| null|
| null|2020-01-21 15:38:39| null|
| null|2020-01-21 15:14:55| null|
the new columns are created with type string and double:
|-- current_time: string (nullable = true)
|-- diff: double (nullable = true)
|-- time_diff: double (nullable = true)
I tried creating the column with string and literal values just to test, but the output is always null. What am I missing?
To fill a column with current_datetime, you are missing the lit() function:
current_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
productsDF = productsDF.withColumn("current_time", lit(current_datetime))
For calculating the time difference between the two timestamp columns, you can do:
productsDF.withColumn('time_diff',(F.unix_timestamp('current_time') -
F.unix_timestamp('Created_datetime'))/3600).show()
EDIT:
For time difference in hours, days, months, and years, you can do:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn("time_diff_days", datediff(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_months", months_between(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_years", year(col("current_time")) - year(col("Created_datetime"))).show()
+-------------------+-------------------+------------------+--------------+----------------+---------------+
| Created_datetime| current_time| time_diff_hours|time_diff_days|time_diff_months|time_diff_years|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 369| 12.07743093| 1|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335| 317| 10.38135529| 1|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 273| 8.94031549| 0|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
If you want EXACT time differences, then:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn('time_diff_days',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24))\
.withColumn('time_diff_years',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24*365)).show()
+-------------------+-------------------+------------------+------------------+------------------+
| Created_datetime| current_time| time_diff_hours| time_diff_days| time_diff_years|
+-------------------+-------------------+------------------+------------------+------------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 368.4003587962963|1.0093160514967021|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335|316.78034722222225|0.8678913622526636|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 272.1081134259259|0.7455016806189751|
+-------------------+-------------------+------------------+------------------+------------------+

How to filter Date Columns and store them as numbers in Data Frames using Scala

I have a dataframe (dateds1) which looks like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
I need help in filtering it out, my output should look like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
I tried using filter, but I was able to filter only for either of the one case, when the dates are in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format and number of columns are fixed. Can someone please help me in figuring it out for both the formats and when the number of columns are dynamic(They might be increasing or decreasing).
They should be converted from Date Datatype to Integers or Long in this format YYYYMMDD.
Note: The records in this Dataframe or either in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format.
Any help is appreciated. Thanks
To do that conversion dynamically you'll have to iterate through all columns and perform different operations depending on the column type.
Here's an example:
import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp
val originalDf = Seq(
(Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
(Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")
Original table details:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
Example of conversion operation:
val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
val data_type = df.schema(name).dataType
if(data_type == DateType)
df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
else if(data_type == TimestampType)
df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
else
df
})
New table details:
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
If you don't want to perform this operation in all columns you can manually specify the columns by changing
val newDf = originalDf.columns.foldLeft ...
to
val newDf = Seq("col1_name","col2_name", ... ).foldLeft ...
Hope this helps!

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

Convert from timestamp to specific date in pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.
Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.
Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+