PySpark custom TimestampType column conversion - pyspark

I'm trying to convert a String column, which is essentially a date, into a TimestampType column, however I'm having problems with how to split the value.
-RECORD 0-------------------------------------
year | 2016
month | 4
arrival_date | 2016-04-30
date_added | 20160430
allowed_date | 10292016
I have 3 columns, all of which are in different formats so I'm trying to find a way to split the string in a custom way, since the date_added column is yyyymmdd and the allowed_date is mmddyyyy.
I've tried something in the lines of:
df_imigration.withColumn('cc'.F.date_format(df_imigration.allowed_date.cast(dataType=t.TimestampType()), "yyyy-mm-dd"))
But with no success and I'm kind of stuck tring to find what's the right or best way to solve this.
The t and F aliases are for the following imports:
from pyspark.sql import functions as F
from pyspark.sql import types as t

The problem with your code is you are casting the date without specifying the date format.
To specify the format you should use function to_timestamp().
Here I have created a dataframe with three different formats and it worked.
df1 = spark.createDataFrame([("20201231","12312020","31122020"), ("20201231","12312020","31122020" )], ["ID","Start_date","End_date"])
df1=df1.withColumn('cc',f.date_format(f.to_timestamp(df1.ID,'yyyymmdd'), "yyyy-mm-dd"))
df1=df1.withColumn('dd',f.date_format(f.to_timestamp(df1.Start_date,'mmddyyyy'), "yyyy-mm-dd"))
df1.withColumn('ee',f.date_format(f.to_timestamp(df1.End_date,'ddmmyyyy'), "yyyy-mm-dd")).show()
Output:
+--------+----------+--------+----------+----------+----------+
| ID|Start_date|End_date| cc| dd| ee|
+--------+----------+--------+----------+----------+----------+
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
+--------+----------+--------+----------+----------+----------+
Hope it helps!

Related

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

Changing date format in Spark returns incorrect result

I am trying to convert a string type date from a csv file to date format first and then to convert that to a particularly expected date format. While doing so, for a row (for the first time) I saw the date format change is changing the year value.
scala> df1.filter($"pt" === 2720).select("`date`").show()
+----------+
| date|
+----------+
|24/08/2019|
|30/12/2019|
+----------+
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), YYYY-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2020-12-30|
+------------------------------------------------------+
As you can see above, in the above, the two rows of data has 24/08/2019 and 30/12/2019 respectively, but after explicit type casting and date format change, it gives 2019-08-24 (which is correct) and 2020-12-30 (incorrect and unexpected).
Why does this problem occur and how can this be best avoided?
I solved this issue by changing the capital YYYY to yyyy in the expected date format parameter.
So, instead of
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
I am now doing
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
This is because, as per this Java's SimpleDateFormat, the capital Y is parsed as week year where as small letter y is parsed as year.
So, now, when I ran with small y in the year's field, I get the expected result:
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), yyyy-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2019-12-30|
+------------------------------------------------------+

Pyspark sql add letter in datetype value

I have epoch time values in Spark dataframe like 1569872588019 and I'm using pyspark sql in jupyter notebook.
I'm using the from_unixtime method to convert it to date.
Here is my code:
SELECT from_unixtime(dataepochvalues/1000,'yyyy-MM-dd%%HH:MM:ss') AS date FROM testdata
The result is like: 2019-04-30%%11:09:11
But what I want is like: 2019-04-30T11:04:48.366Z
I tried to add T and Z instead of %% in date but failed.
How can I insert T and Z letter?
You can specify those letters using single quotes. For your desired output, use the following date and time pattern:
"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
Using your example:
spark.sql(
"""SELECT from_unixtime(1569872588019/1000,"yyyy-MM-dd'T'HH:MM:ss'Z'") AS date"""
).show()
#+--------------------+
#| date|
#+--------------------+
#|2019-09-30T14:09:08Z|
#+--------------------+

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help
You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"