newly created column shows null values in pyspark dataframe - pyspark

I want to add a column calculating the difference in time between two two timestamp values. In order to do that I first add a column with the current datetime which is define as current_datetime here:
import datetime
#define current datetime
now = datetime.datetime.now()
#Getting Current date and time
current_datetime=now.strftime("%Y-%m-%d %H:%M:%S")
print(now)
then I want to add current_datetime as column value to the df and calculate the diff
import pyspark.sql.functions as F
productsDF = productsDF\
.withColumn('current_time', when(col('Quantity')>1, current_datetime))\
.withColumn('time_diff',\
(F.unix_timestamp(F.to_timestamp(F.col('current_time')))) -
(F.unix_timestamp(F.to_timestamp(F.col('Created_datetime'))))/F.lit(3600)
)
The output however is only null values.
productsDF.select('current_time','Created_datetime','time_diff').show()
+------------+-------------------+---------+
|current_time| Created_datetime|time_diff|
+------------+-------------------+---------+
| null|2019-10-12 17:09:18| null|
| null|2019-12-03 07:02:07| null|
| null|2020-01-16 23:10:08| null|
| null|2020-01-21 15:38:39| null|
| null|2020-01-21 15:14:55| null|
the new columns are created with type string and double:
|-- current_time: string (nullable = true)
|-- diff: double (nullable = true)
|-- time_diff: double (nullable = true)
I tried creating the column with string and literal values just to test, but the output is always null. What am I missing?

To fill a column with current_datetime, you are missing the lit() function:
current_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
productsDF = productsDF.withColumn("current_time", lit(current_datetime))
For calculating the time difference between the two timestamp columns, you can do:
productsDF.withColumn('time_diff',(F.unix_timestamp('current_time') -
F.unix_timestamp('Created_datetime'))/3600).show()
EDIT:
For time difference in hours, days, months, and years, you can do:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn("time_diff_days", datediff(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_months", months_between(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_years", year(col("current_time")) - year(col("Created_datetime"))).show()
+-------------------+-------------------+------------------+--------------+----------------+---------------+
| Created_datetime| current_time| time_diff_hours|time_diff_days|time_diff_months|time_diff_years|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 369| 12.07743093| 1|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335| 317| 10.38135529| 1|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 273| 8.94031549| 0|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
If you want EXACT time differences, then:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn('time_diff_days',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24))\
.withColumn('time_diff_years',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24*365)).show()
+-------------------+-------------------+------------------+------------------+------------------+
| Created_datetime| current_time| time_diff_hours| time_diff_days| time_diff_years|
+-------------------+-------------------+------------------+------------------+------------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 368.4003587962963|1.0093160514967021|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335|316.78034722222225|0.8678913622526636|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 272.1081134259259|0.7455016806189751|
+-------------------+-------------------+------------------+------------------+------------------+

Related

Find max value from different columns in a single row in scala DataFrame

I tried to find out the max value from different columns in a single row in scala dataframe.
The data available in dataframe is as below.
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
| NUM| SIG1| SIG2| SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]| null |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|
and the schema is
scala> DF.printSchema
root
|-- NUM: string (nullable = true)
|-- SIG1: string (nullable = true)
|-- SIG2: string (nullable = true)
|-- SIG3: string (nullable = true)
The expected output is as below.
+-------+--------------+----------+------------+------------+
| NUM| TIME | SIG1| | SIG2 | SIG3 |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825 | 4.7825 | 2.7825 |
|XXXXX01| 1569560541003| 1.7825 | 8.7825 | 5.7825 |
|XXXXX01| 1569560531009| 3.7825 | 3.7825 | null |
|XXXXX02| 1569560531007| 5.7825 | 8.7825 | 3.7825 |
|XXXXX02| 1569560531010| 9.7825 | 1.7825 | 3.7825 |
I need to add a new column with highest TIME from a single row and SIG columns with their value only.
Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.
Is there any UDF/functions to achieve this?
Thanks in Advance.
Use get_json_object function to extract values from json stored as a string.
Then it's quite straightforward:
DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
get_json_object('SIG2, "$[0].TIME"),
get_json_object('SIG3, "$[0].TIME")))
.withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
.withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
.withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
.show

Spark scala - calculating dynamic timestamp interval

have dataframe with a timestamp column (timestamp type) called "maxTmstmp" and another column with hours, represented as integers called "WindowHours". I would like to dynamically subtract timestamp and integer columns to get lower timestamp.
My data and desired effect ("minTmstmp" column):
+-----------+-------------------+-------------------+
|WindowHours| maxTmstmp| minTmstmp|
| | |(maxTmstmp - Hours)|
+-----------+-------------------+-------------------+
| 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
| 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
| 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
| 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
+-----------+-------------------+-------------------+
root
|-- WindowHours: integer (nullable = true)
|-- maxTmstmp: timestamp (nullable = true)
I have already found an expressions with hours interval solution, but it isn't dynamic. Code below doesn't work as intended.
standards.
.withColumn("minTmstmp", $"maxTmstmp" - expr("INTERVAL 10 HOURS"))
.show()
Operate on Spark 2.4 and scala.
One simple way would be to convert maxTmstmp to unix time, subtract the value of WindowHours in seconds from it, and convert the result back to Spark Timestamp, as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2016-01-01 23:00:00")),
(2, Timestamp.valueOf("2016-03-01 12:00:00")),
(8, Timestamp.valueOf("2016-03-05 20:00:00")),
(24, Timestamp.valueOf("2016-04-12 11:00:00"))
).toDF("WindowHours", "maxTmstmp")
df.withColumn("minTmstmp",
from_unixtime(unix_timestamp($"maxTmstmp") - ($"WindowHours" * 3600))
).show
// +-----------+-------------------+-------------------+
// |WindowHours| maxTmstmp| minTmstmp|
// +-----------+-------------------+-------------------+
// | 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
// | 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
// | 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
// | 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
// +-----------+-------------------+-------------------+

Convert from timestamp to specific date in pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.
Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.
Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

Assign label to categorical data in a table in PySpark

I want to assign the label to the categorical numbers in a dataframe below using pyspark sql.
In the MARRIAGE column 1=Married and 2=Unmarried. In the EDUCATION Column 1=Grad and 2=Undergrad
Current Dataframe:
+--------+---------+-----+
|MARRIAGE|EDUCATION|Total|
+--------+---------+-----+
| 1| 2| 87|
| 1| 1| 123|
| 2| 2| 3|
| 2| 1| 8|
+--------+---------+-----+
Resulting Dataframe:
+---------+---------+-----+
|MARRIAGE |EDUCATION|Total|
+---------+---------+-----+
|Married |Grad | 87|
|Married |UnderGrad| 123|
|UnMarried|Grad | 3|
|UnMarried|UnderGrad| 8|
+---------+---------+-----+
Is it possible to assign the labels using a single udf and the withColumn()? Is there any way to assign in the single UDF by passing the whole dataframe and keep the column names as it is?
I can think of a solution to do the operation on each column by using separate udfs as below. But can't figure out if there's a way to do together.
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)
One UDF can result in only one column. But this can be structured column and UDF can apply labels on both marriage and education. See code below:
from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
root
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
But as you see, it's not replacing the original columns, it's just adding a new one. To replace them you will need to use withColumn twice and then drop labels.