pyspark : Convert string to date format without minute, decod and hour - pyspark

Hello I would like to convert string date to date format:
for example from 190424 to 2019-01-24
I try with this code :
tx_wd_df = tx_wd_df.select(
'dateTransmission',
from_unixtime(unix_timestamp('dateTransmission', 'yymmdd')).alias('dateTransmissionDATE')
)
But I got this format : 2019-01-24 00:04:00
I would like only 2019-01-24
Any idea please?
Thanks
tx_wd_df.show(truncate=False)

You can simply use to_date(). This will discard the rest of the date, and pick up only the format that matches the input date format string.
import pyspark.sql.functions as F
date_column = "dateTransmission"
# MM because mm in Java Simple Date Format is minutes, and MM is months
date_format = "yyMMdd"
df = df.withColumn(date_column, F.to_date(F.col(date_column), date_format))

Related

Convert using unixtimestamp to Date

I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))

How to make timestamp date column into preferred format dd/MM/yyyy?

I have a column YDate in the form yyyy-MM-dd HH:mm:ss (timestamp type) but would like to convert it to dd/MM/yyyy.
I tried that;
df = df.withColumn('YDate',F.to_date(F.col('YDate'),'dd/MM/yyyy'))
but get yyyy-MM-dd.
How can I effectively do this.
Use date_format instead:
df = df.withColumn('YDate',F.date_format(F.col('YDate'),'dd/MM/yyyy'))
to_date converts from the given format, while date_format converts into the given format.
You can use date_format function present in the pyspark library.
For more information about date formats you can refer to Date Format Documentation.
Below is the code snippet to solve your usecase.
from pyspark.sql import functions as F
df = spark.createDataFrame([('2015-12-28 23:59:59',)], ['YDate'])
df = df.withColumn('YDate', F.date_format('YDate', 'dd/MM/yyy'))

pyspark How to filter rows based on HH:mm:ss portion in timestamp column

I have a dataframe in pyspark that has a timestamp string column in the following format:
"11/21/2018 07:21:49 PM"
This is in 24 hours format.
I want to filter the rows in the dataframe based on only the time portion of this string timestamp regardless of the date. For example I want to keep all rows that fall between the hours of 2:00pm and 4:00pm inclusive.
I tried the below to extract the HH:mm:ss and use the function between but it is not working.
# Grabbing only time portion from datetime column
import pyspark.sql.functions as F
time_format = "HH:mm:ss"
split_col = F.split(df['datetime'], ' ')
df = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
df = df.withColumn('Timestamp', from_unixtime(unix_timestamp('Time', format=time_format)))
df.filter(F.col("Timestamp").between('14:00:00','16:00:00')).show()
Any ideas on how to filter rows only based on the HH:mm:ss portion in a timestamp column regardless of the actual date, would be very appreciated.
Format your timestamp to HH:mm:ss then filter using between clause.
Example:
df=spark.createDataFrame([("11/21/2018 07:21:49 PM",),("11/22/2018 04:21:49 PM",),("11/23/2018 12:21:49 PM",)],["ts"])
from pyspark.sql.functions import *
df.withColumn("tt",from_unixtime(unix_timestamp(col("ts"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).\
filter(col("tt").between("12:00","16:00")).\
show()
#+----------------------+--------+
#|ts |tt |
#+----------------------+--------+
#|11/23/2018 12:21:49 PM|12:21:49|
#+----------------------+--------+

Pyspark: Output to csv -- Timestamp format is different

I am working with a dataset with the following Timestamp format: yyyy-MM-dd HH:mm:ss
When I output the data to csv the format changes to something like this: 2019-04-29T00:15:00.000Z
Is there any way to get it to the original format like: 2019-04-29 00:15:00
Do I need to convert that column to string and then push it to csv?
I am saying my file to csv like so:
df.coalesce(1).write.format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).save("date_fix.csv")
Alternative
spark >=2.0.0
set option("timestampFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")
df.coalesce(1).write.format("csv"
).mode('overwrite'
).option("header", "true"
).option("timestampFormat", "yyyy-MM-dd HH:mm:ss"
).save("date_fix.csv")
As per documentation-
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
spark < 2.0.0
set option("dateFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")
df.coalesce(1).write.format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).option("dateFormat", "yyyy-MM-dd HH:mm:ss"
).save("date_fix.csv")
As per documentation-
dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf()
ref - readme
Yes, that's correct. The easiest way to achieve this is using pyspark.sql.functions.date_format such as:
import pyspark.sql.functions as f
df.withColumn(
"date_column_formatted",
f.date_format(f.col("timestamp"), "yyyy-MM-dd HH:mm:ss")
)
More info about it here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.date_format.
Hope this helps!

Hive date format

Anybody converted a date from mm/dd/yyyy hh:mm format to
yyyy-mm-dd hh:mm:ss format using hive query ?
I have a string with date in the / format need to add some duration in it
Do this:
select
regexp_replace('2015/04/15','(\\d{4})\\/{1}(\\d{2})\\/{1}(\\d{2})','$1-$2-$3') as dt
from x;
INPUT:2015/04/05
OUTPUT:2015-04-05
Grab four numeric digits (\d{4}), two (\d{2}), and two more (\d{2}) from the original string and put them in that order seperated by dashes.