PySpark custom TimestampType column conversion

PySpark custom TimestampType column conversion - pyspark

I'm trying to convert a String column, which is essentially a date, into a TimestampType column, however I'm having problems with how to split the value.
-RECORD 0-------------------------------------
year | 2016
month | 4
arrival_date | 2016-04-30
date_added | 20160430
allowed_date | 10292016
I have 3 columns, all of which are in different formats so I'm trying to find a way to split the string in a custom way, since the date_added column is yyyymmdd and the allowed_date is mmddyyyy.
I've tried something in the lines of:
df_imigration.withColumn('cc'.F.date_format(df_imigration.allowed_date.cast(dataType=t.TimestampType()), "yyyy-mm-dd"))
But with no success and I'm kind of stuck tring to find what's the right or best way to solve this.
The t and F aliases are for the following imports:
from pyspark.sql import functions as F
from pyspark.sql import types as t

The problem with your code is you are casting the date without specifying the date format.
To specify the format you should use function to_timestamp().
Here I have created a dataframe with three different formats and it worked.
df1 = spark.createDataFrame([("20201231","12312020","31122020"), ("20201231","12312020","31122020" )], ["ID","Start_date","End_date"])
df1=df1.withColumn('cc',f.date_format(f.to_timestamp(df1.ID,'yyyymmdd'), "yyyy-mm-dd"))
df1=df1.withColumn('dd',f.date_format(f.to_timestamp(df1.Start_date,'mmddyyyy'), "yyyy-mm-dd"))
df1.withColumn('ee',f.date_format(f.to_timestamp(df1.End_date,'ddmmyyyy'), "yyyy-mm-dd")).show()
Output:
+--------+----------+--------+----------+----------+----------+
| ID|Start_date|End_date| cc| dd| ee|
+--------+----------+--------+----------+----------+----------+
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
+--------+----------+--------+----------+----------+----------+
Hope it helps!

Related

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?

To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?

There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+

unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

Changing date format in Spark returns incorrect result

I am trying to convert a string type date from a csv file to date format first and then to convert that to a particularly expected date format. While doing so, for a row (for the first time) I saw the date format change is changing the year value.
scala> df1.filter($"pt" === 2720).select("`date`").show()
+----------+
| date|
+----------+
|24/08/2019|
|30/12/2019|
+----------+
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), YYYY-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2020-12-30|
+------------------------------------------------------+
As you can see above, in the above, the two rows of data has 24/08/2019 and 30/12/2019 respectively, but after explicit type casting and date format change, it gives 2019-08-24 (which is correct) and 2020-12-30 (incorrect and unexpected).
Why does this problem occur and how can this be best avoided?

I solved this issue by changing the capital YYYY to yyyy in the expected date format parameter.
So, instead of
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"YYYY-MM-dd")).show()
I am now doing
df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
This is because, as per this Java's SimpleDateFormat, the capital Y is parsed as week year where as small letter y is parsed as year.
So, now, when I ran with small y in the year's field, I get the expected result:
scala> df1.filter($"pt" === 2720).select(date_format(to_date($"`date`","dd/MM/yyyy"),"yyyy-MM-dd")).show()
+------------------------------------------------------+
|date_format(to_date(`date`, 'dd/MM/yyyy'), yyyy-MM-dd)|
+------------------------------------------------------+
| 2019-08-24|
| 2019-12-30|
+------------------------------------------------------+

Pyspark sql add letter in datetype value

I have epoch time values in Spark dataframe like 1569872588019 and I'm using pyspark sql in jupyter notebook.
I'm using the from_unixtime method to convert it to date.
Here is my code:
SELECT from_unixtime(dataepochvalues/1000,'yyyy-MM-dd%%HH:MM:ss') AS date FROM testdata
The result is like: 2019-04-30%%11:09:11
But what I want is like: 2019-04-30T11:04:48.366Z
I tried to add T and Z instead of %% in date but failed.
How can I insert T and Z letter?

You can specify those letters using single quotes. For your desired output, use the following date and time pattern:
"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
Using your example:
spark.sql(
"""SELECT from_unixtime(1569872588019/1000,"yyyy-MM-dd'T'HH:MM:ss'Z'") AS date"""
).show()
#+--------------------+
#| date|
#+--------------------+
#|2019-09-30T14:09:08Z|
#+--------------------+

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help

You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark custom TimestampType column conversion - pyspark

Related

cast big number in human readable format

PySpark string column to timestamp conversion

Changing date format in Spark returns incorrect result

Pyspark sql add letter in datetype value

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

Categories

Resources