Pyspark using withColumn to add a derived column to a dataframe - pyspark

I currently have a dataframe which has the following schema;
Year: integer (nullable = true)
Month: integer (nullable = true)
Day: integer (nullable = true)
Hour: integer (nullable = true)
Minute: integer (nullable = true)
Second: integer (nullable = true)
I want to basically add an additional column to my dataframe which uses the above date components to construct a datetime type column. I am currently attempting this using the following;
df = df.withColumn("DeptDateTime",getDate(df['Year'], df['Month'], df['Day'], df['Hour'], df['Minute'], df['Second']))
I'm struggling with writing the function getDate as I want to check the length of Year (currently an Integer) & if it's 2 digits (i.e. 16) then prefix "20" to make "2016" etc. This needs to be done for each of the date components to essentially construct a date time in the following format: yyyy-mm-dd hh:mm:ss
Any help would be appreciated.

Convert to date
First, let's create some sample dataset.
df_pd = pd.DataFrame([[16, 3, 15],
[2016, 4, 3]],
columns=['Year', 'Month', 'Day'])
df = spark.createDataFrame(df_pd)
Then you can write udf function to work around your question.
from pyspark.sql import functions as func
from pyspark.sql.types import *
def get_date(year, month, day):
year = str(year)
month = str(month)
day = str(day)
if len(str(year)) == 2:
year = '20' + year
return year + '-' + month + '-' + day
udf_get_date = func.udf(get_date, returnType=StringType())
Now, we can apply the our UDF function to 3 columns and use .cast(DateType()) again so that you have the right format
df = df.withColumn('date', udf_get_date('Year', 'Month', 'Day').cast(DateType()))
Output
+----+-----+---+----------+
|Year|Month|Day| date|
+----+-----+---+----------+
| 16| 3| 15|2016-03-15|
|2016| 4| 3|2016-04-03|
+----+-----+---+----------+
Convert to date-time format
This is very similar, I put some variant here where you can use datetime also.
import pandas as pd
import datetime
df_pd = pd.DataFrame([[16, 3, 15, 10, 34, 14],
[2016, 4, 3, 23, 8, 12]],
columns=['Year', 'Month', 'Day', 'Hour', 'Minute', 'Second'])
df = spark.createDataFrame(df_pd)
def get_date(year, month, day, hour, minute, second):
year = str(year)
if len(str(year)) == 2:
year = '20' + year
return str(datetime.datetime(int(year), month, day, hour, minute, second))
udf_get_date = func.udf(get_date, returnType=StringType())
df = df.withColumn('date', udf_get_date('Year', 'Month', 'Day', 'Hour', 'Minute', 'Second').cast(TimestampType()))
Output
+----+-----+---+----+------+------+--------------------+
|Year|Month|Day|Hour|Minute|Second| date|
+----+-----+---+----+------+------+--------------------+
| 16| 3| 15| 10| 34| 14|2016-03-15 10:34:...|
|2016| 4| 3| 23| 8| 12|2016-04-03 23:08:...|
+----+-----+---+----+------+------+--------------------+

You can use datetime module to create your format.
The following code worked for me,
from datetime import datetime
def getdate(*args):
dt_str = '-'.join(map(str,args[:3])) + ' ' + ':'.join(map(str,args[3:]))
yr_len = len(str(args[0]))
if yr_len == 2:
yr = 'y'
else:
yr = 'Y'
formtd_date = datetime.strptime(dt_str,"%{}-%m-%d %H:%M:%S".format(yr)).strftime("%Y-%m-%d %H:%M:%S")
return formtd_date
Test input :
1.getdate(16,1,2,4,5,6)
2.getdate(2016,1,2,04,5,58)
output :
1.2016-01-02 04:05:06
2.2016-01-02 04:05:58

Related

Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
df.show(10)
df.printSchema()
Output:
+------+----------+
| x| date|
+------+----------+
|202203|2022-03-31|
+------+----------+
root
|-- x: long (nullable = true)
|-- date: date (nullable = true)

newly created column shows null values in pyspark dataframe

I want to add a column calculating the difference in time between two two timestamp values. In order to do that I first add a column with the current datetime which is define as current_datetime here:
import datetime
#define current datetime
now = datetime.datetime.now()
#Getting Current date and time
current_datetime=now.strftime("%Y-%m-%d %H:%M:%S")
print(now)
then I want to add current_datetime as column value to the df and calculate the diff
import pyspark.sql.functions as F
productsDF = productsDF\
.withColumn('current_time', when(col('Quantity')>1, current_datetime))\
.withColumn('time_diff',\
(F.unix_timestamp(F.to_timestamp(F.col('current_time')))) -
(F.unix_timestamp(F.to_timestamp(F.col('Created_datetime'))))/F.lit(3600)
)
The output however is only null values.
productsDF.select('current_time','Created_datetime','time_diff').show()
+------------+-------------------+---------+
|current_time| Created_datetime|time_diff|
+------------+-------------------+---------+
| null|2019-10-12 17:09:18| null|
| null|2019-12-03 07:02:07| null|
| null|2020-01-16 23:10:08| null|
| null|2020-01-21 15:38:39| null|
| null|2020-01-21 15:14:55| null|
the new columns are created with type string and double:
|-- current_time: string (nullable = true)
|-- diff: double (nullable = true)
|-- time_diff: double (nullable = true)
I tried creating the column with string and literal values just to test, but the output is always null. What am I missing?
To fill a column with current_datetime, you are missing the lit() function:
current_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
productsDF = productsDF.withColumn("current_time", lit(current_datetime))
For calculating the time difference between the two timestamp columns, you can do:
productsDF.withColumn('time_diff',(F.unix_timestamp('current_time') -
F.unix_timestamp('Created_datetime'))/3600).show()
EDIT:
For time difference in hours, days, months, and years, you can do:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn("time_diff_days", datediff(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_months", months_between(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_years", year(col("current_time")) - year(col("Created_datetime"))).show()
+-------------------+-------------------+------------------+--------------+----------------+---------------+
| Created_datetime| current_time| time_diff_hours|time_diff_days|time_diff_months|time_diff_years|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 369| 12.07743093| 1|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335| 317| 10.38135529| 1|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 273| 8.94031549| 0|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
If you want EXACT time differences, then:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn('time_diff_days',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24))\
.withColumn('time_diff_years',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24*365)).show()
+-------------------+-------------------+------------------+------------------+------------------+
| Created_datetime| current_time| time_diff_hours| time_diff_days| time_diff_years|
+-------------------+-------------------+------------------+------------------+------------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 368.4003587962963|1.0093160514967021|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335|316.78034722222225|0.8678913622526636|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 272.1081134259259|0.7455016806189751|
+-------------------+-------------------+------------------+------------------+------------------+

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

Converting pattern of date in spark dataframe

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.