Convert from timestamp to specific date in pyspark - pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.

Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.

Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.

Related

Change data type column from string to date with custom format

I have a DF with a string column called "data" in the format 02/09/2019 (dd/MM/yyyy). I want to change the data type of the column from STRING to DATE, maintaining the same format. I'm using Spark 2.1.0.
I've tried the statement:
df.select(to_date( unix_timestamp($"data", "dd/MM/yyyy").cast("timestamp")))
It converts the column from STRING to DATE but in yyyy-MM-dd format:
+----------+
| data|
+----------+
|2003-07-22|
|2003-08-01|
+----------+
Using date_format function, I obtain the right format but wrong data type (again STRING):
df.select(date_format(to_date( unix_timestamp($"data", "dd/MM/yyyy").cast("timestamp")), "dd/MM/yyyy") as "data").printSchema()
Thanks a lot.
Date datatype expects the format as yyyy-MM-dd.
If we have format as dd/MM/yyyy and we cannot cast as date datatype (casting will result null value).
Example:
df.show() //sample data
+----------+
| data|
+----------+
|22/07/2003|
|01/08/2003|
+----------+
df.selectExpr("date(data)").show() //casting to date type
+----+
|data|
+----+
|null|
|null|
+----+
How to cast to Datetype?
df.select(to_date(unix_timestamp($"data","dd/MM/yyyy").cast("timestamp")).alias("da")).show()
(or)
df.select(from_unixtime(unix_timestamp($"data","dd/MM/yyyy"),"yyyy-MM-dd").cast("date").alias("da")).show()
+----------+
| da|
+----------+
|2003-07-22|
|2003-08-01|
+----------+
printSchema:
df.select(from_unixtime(unix_timestamp($"data","dd/MM/yyyy"),"yyyy-MM-dd").cast("date").alias("dd")).printSchema
root
|-- dd: date (nullable = true)

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

PySpark: String to timestamp transformation

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))
Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.