How can I operate with an interval type variable? - pyspark

I have two timestamp columns('tpep_pickup_datetime' and 'tpep_dropoff_datetime') and when I calculate the difference between them, I get an interval variable.
yellowcab = yellowcab \
.withColumn('tpep_pickup_datetime', to_timestamp('tpep_pickup_datetime','yyyy-MM-dd HH:mm:ss'))\
.withColumn('tpep_dropoff_datetime', to_timestamp('tpep_dropoff_datetime','yyyy-MM-dd HH:mm:ss'))
yellowcab = yellowcab \
.withColumn('total_time', col('tpep_dropoff_datetime')-col('tpep_pickup_datetime'))
The result looks like that:
I want to transform 'total_time' column to an 'int' variable with the time converted to seconds.
I have tried to extract the hours and the minutes from the interval variable and then multiply them in order to convert to seconds, but I have not been able to do it

Cast the interval into int.
data = [['2020-08-01 00:02:53', '2020-08-01 00:28:54']]
df = spark.createDataFrame(data, ['t1', 't2']) \
.withColumn('t1', f.to_timestamp('t1','yyyy-MM-dd HH:mm:ss')) \
.withColumn('t2', f.to_timestamp('t2','yyyy-MM-dd HH:mm:ss')) \
.withColumn('interval', (f.col('t2') - f.col('t1')).cast('int')) \
.show()
+-------------------+-------------------+--------+
| t1| t2|interval|
+-------------------+-------------------+--------+
|2020-08-01 00:02:53|2020-08-01 00:28:54| 1561|
+-------------------+-------------------+--------+

Related

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

PySpark get value of dataframe column with max date

I need to create a new column in a pyspark dataframe using a column value from the row of the max date over a window. Given the dataframe below, I need to set a new column called max_adj_factor on each record for each assetId based on the adjustment factor of the most recent date.
+----------------+-------+----------+-----+
|adjustmentFactor|assetId| date| nav|
+----------------+-------+----------+-----+
|9.96288362069999|4000123|2019-12-20| 18.5|
|9.96288362069999|4000123|2019-12-23|18.67|
|9.96288362069999|4000123|2019-12-24| 18.6|
|9.96288362069999|4000123|2019-12-26|18.57|
|10.0449181987999|4000123|2019-12-27|18.46|
|10.0449181987999|4000123|2019-12-30|18.41|
|10.0449181987999|4000123|2019-12-31|18.34|
|10.0449181987999|4000123|2020-01-02|18.77|
|10.0449181987999|4000123|2020-01-03|19.07|
|10.0449181987999|4000123|2020-01-06|19.16|
|10.0449181987999|4000123|2020-01-07| 19.2|
You can use max_by over a Window:
df.withColumn("max_adj_factor", \
F.expr("max_by(adjustmentFactor, date)") \
.over(Window.partitionBy("assetId"))) \
.show()
Output:
+----------------+-------+----------+-----+----------------+
|adjustmentFactor|assetId| date| nav| max_adj_factor|
+----------------+-------+----------+-----+----------------+
|9.96288362069999|4000123|2019-12-20| 18.5|10.0449181987999|
|9.96288362069999|4000123|2019-12-23|18.67|10.0449181987999|
|9.96288362069999|4000123|2019-12-24| 18.6|10.0449181987999|
|9.96288362069999|4000123|2019-12-26|18.57|10.0449181987999|
|10.0449181987999|4000123|2019-12-27|18.46|10.0449181987999|
|10.0449181987999|4000123|2019-12-30|18.41|10.0449181987999|
|10.0449181987999|4000123|2019-12-31|18.34|10.0449181987999|
|10.0449181987999|4000123|2020-01-02|18.77|10.0449181987999|
|10.0449181987999|4000123|2020-01-03|19.07|10.0449181987999|
|10.0449181987999|4000123|2020-01-06|19.16|10.0449181987999|
|10.0449181987999|4000123|2020-01-07| 19.2|10.0449181987999|
+----------------+-------+----------+-----+----------------+

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+
The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+

spark sql datediff in days

I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+
Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()
date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+
I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().

pyspark- generating date sequence

I am trying to generate a date sequence
from pyspark.sql import functions as F
df1 = df.withColumn("start_dt", F.to_date(F.col("start_date"), "yyyy-mm-dd")) \
.withColumn("end_dt", F.to_date(F.col("end_date"), "yyyy-mm-dd"))
df1.select("start_dt", "end_dt").show()
print("type(start_dt)", type("start_dt"))
print("type(end_dt)", type("end_dt"))
df2 = df1.withColumn("lineoffdate", F.expr("""sequence(start_dt,end_dt,1)"""))
Below is the output
+---------------+----------+
| start_date | end_date|
+---------------+----------+
| 2020-02-01|2020-03-21|
+---------------+----------+
type(start_dt) <class 'str'>
type(end_dt) <class 'str'>
cannot resolve 'sequence(start_dt, end_dt, 1)' due to data type mismatch: sequence only supports integral, timestamp or date types; line 1 pos 0;
Even after converting the start dt and end dt to date or timestamp, I see the type of the column still str and getting above mentioned error while generating the date sequence.
You are correct in saying it should work with date or timestamp(calendar types), however, the only mistake you were making was you were putting the "step" in sequence as integer, when it should be calendar interval(like interval 1 day):
df.withColumn("start_date",F.to_date("start_date")) \
.withColumn("end_date", F.to_date("end_date")) \
.withColumn(
"lineofdate",
F.expr("""sequence(start_date,end_date,interval 1 day)""") \
) \
.show()
# output:
# +----------+----------+--------------------+
# |start_date| end_date| lineofdate|
# +----------+----------+--------------------+
# |2020-02-01|2020-03-21|[2020-02-01, 2020...|
# +----------+----------+--------------------+