PySpark: String to timestamp transformation - pyspark

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))

Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

Related

Convert date from "yyyy/mm/dd" format to "M/d/yyyy" format in pyspark dataframe

I am reading a table to a dataframe which has a column "day_dt" which is in date format "2022/01/08". I want the format to be in "1/8/2022" (M/d/yyyy) Is it possible in pyspark? I have tried using date_format() but resulting in null.
Did you cast day_dt column to timestamp before using date_format? Code below adds a null valued column as you described in your question because it is StringType. You can see it using df.printSchema()
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value ,
"yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+-------+
| value|newDate|
+----------+-------+
|2022/01/08| null|
+----------+-------+
After casting string type to timestamp, date column is formatted properly:
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value , "yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+----------+
| value| newDate|
+----------+----------+
|2022/01/08|01/08/2022|
+----------+----------+
Hope it helps.
If you mean you have date as string in format "yyyy/mm/dd" and you want to convert it to a string with format "M/d/yyyy", then:
First parse string to Date type using to_date().
Then, convert Date type to string using date_format.
df = spark.createDataFrame(data=[["2022/01/01",],["2022/12/31",]], schema=["date_str_in"])
df = df.withColumn("date_dt", F.to_date("date_str_in", format="yyyy/MM/dd"))
df = df.withColumn("date_str_out", F.date_format("date_dt", format="M/d/yyyy"))
+-----------+----------+------------+
|date_str_in| date_dt|date_str_out|
+-----------+----------+------------+
| 2022/01/01|2022-01-01| 1/1/2022|
| 2022/12/31|2022-12-31| 12/31/2022|
+-----------+----------+------------+

reading partitioned parquet record in pyspark

I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
data = [(datetime.date(2022, 6, 12), "Hello"), (datetime.date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.mode('overwrite').partitionBy('date').parquet('./test')
You can read the parquet files associated to a given date with this syntax:
spark.read.parquet('./test/date=2022-06-19').show()
# The catch is that the date column is gonna be omitted from your dataframe
+-------+
|message|
+-------+
| World|
+-------+
# You could try adding the date column with lit syntax.
(spark.read.parquet('./test/date=2022-06-19')
.withColumn('date', f.lit('2022-06-19').cast(DateType()))
.show()
)
# Output
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
The more efficient solution is using the delta tables:
df.write.mode('overwrite').partitionBy('date').format('delta').save('/test')
spark.read.format('delta').load('./test').where(f.col('date') == '2022-06-19').show()
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path
>>> df.show()
+-----+-----------------+-----------+----------+
|Sr_No| User_Id|Transaction| dt|
+-----+-----------------+-----------+----------+
| 1|paytm 111002203#p| 100D|2022-06-29|
| 2|paytm 111002203#p| 50C|2022-06-27|
| 3|paytm 111002203#p| 20C|2022-06-26|
| 4|paytm 111002203#p| 10C|2022-06-25|
| 5| null| 1C|2022-06-24|
+-----+-----------------+-----------+----------+
>>> df.write.partitionBy("dt").mode("append").parquet("/dir1/dir2/sample.parquet")
>>> from datetime import date
>>> from datetime import timedelta
>>> today = date.today()
#Here i am taking two days back date, for one day back you can do (days=1)
>>> yesterday = today - timedelta(days = 2)
>>> two_days_back=yesterday.strftime('%Y-%m-%d')
>>> path="/di1/dir2/sample.parquet/dt="+two_days_back
>>> spark.read.parquet(path).show()
+-----+-----------------+-----------+
|Sr_No| User_Id|Transaction|
+-----+-----------------+-----------+
| 2|paytm 111002203#p| 50C|
+-----+-----------------+-----------+

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

Convert from timestamp to specific date in pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.
Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.
Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.