Pyspark: Output to csv -- Timestamp format is different - pyspark

I am working with a dataset with the following Timestamp format: yyyy-MM-dd HH:mm:ss
When I output the data to csv the format changes to something like this: 2019-04-29T00:15:00.000Z
Is there any way to get it to the original format like: 2019-04-29 00:15:00
Do I need to convert that column to string and then push it to csv?
I am saying my file to csv like so:
df.coalesce(1).write.format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).save("date_fix.csv")

Alternative
spark >=2.0.0
set option("timestampFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")
df.coalesce(1).write.format("csv"
).mode('overwrite'
).option("header", "true"
).option("timestampFormat", "yyyy-MM-dd HH:mm:ss"
).save("date_fix.csv")
As per documentation-
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
spark < 2.0.0
set option("dateFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")
df.coalesce(1).write.format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).option("dateFormat", "yyyy-MM-dd HH:mm:ss"
).save("date_fix.csv")
As per documentation-
dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf()
ref - readme

Yes, that's correct. The easiest way to achieve this is using pyspark.sql.functions.date_format such as:
import pyspark.sql.functions as f
df.withColumn(
"date_column_formatted",
f.date_format(f.col("timestamp"), "yyyy-MM-dd HH:mm:ss")
)
More info about it here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.date_format.
Hope this helps!

Related

talend format yyyy-MM-dd'T'HH:mm:ss.SSSz to yyyy-mm-dd HH:mm:ss

I am trying to change the date format in txmlmap component but its not working
i want change date format
from yyyy-MM-dd'T'HH:mm:ss.SSSz to yyyy-mm-dd HH:mm:ss
expected output:- yyyy-mm-dd HH:mm:ss
You can parse your string to a date using your source pattern and then format that date to a string using your target pattern:
TalendDate.formatDate("yyyy-mm-dd HH:mm:ss", TalendDate.parseDate("yyyy-MM-dd'T'HH:mm:ss.SSSz", myDateString))
In almost all coding languages format is text, while date is a double. That means you must first make a date of the first expression, before setting the new format of that date. But in Your case the 'T' is some kind of special format that need to be replaced with a blanck space. I have no idea about what it would look like in talend but in VB it would look like this:
' from yyyy-MM-dd'T'HH:mm:ss.SSSz to yyyy-mm-dd HH:mm:ss
DateTxt = "2022-12-01'T'22:45:10"
DateTxt = Replace(DateTxt, "'T'", " ")
MyDate = CDate(DateTxt)
MsgBox Format(MyDate, "yyyy-mm-dd HH:mm:ss")

Logstash _dateparsefailure matching timestamp with date plugin

I have a json input with the String field timestamp that I want to parse to date in the field #timestamp in elasticsearch.
The input timestamp field: 2021-06-20 03:37:14.595000+00:00
This is how I've set up the filter in logstash:
date {
match => ["timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss.SSSSSS+ZZ:ZZ", "yyyy-MM-dd HH:mm:ss.SSSSSS"]
target => "#timestamp"
}
The input string is in ISO8601 format, so using only "ISO8601" should work. However, I'm getting the _dateparsefailure. Therefore, I've also tried with the patterns "yyyy-MM-dd HH:mm:ss.SSSSSS+ZZ:ZZ" and "yyyy-MM-dd HH:mm:ss.SSSSSS", with no luck.
I've also tried to set the target to something else, like my_timestamp, in case the value of #timestamp is being overwritten, but that didn't work either.
Could you help me understand why this does not work?
ZZ is used to match "colon in between hour and minute offsets" so you should use "yyyy-MM-dd HH:mm:ss.SSSSSS+ZZ".

How to make timestamp date column into preferred format dd/MM/yyyy?

I have a column YDate in the form yyyy-MM-dd HH:mm:ss (timestamp type) but would like to convert it to dd/MM/yyyy.
I tried that;
df = df.withColumn('YDate',F.to_date(F.col('YDate'),'dd/MM/yyyy'))
but get yyyy-MM-dd.
How can I effectively do this.
Use date_format instead:
df = df.withColumn('YDate',F.date_format(F.col('YDate'),'dd/MM/yyyy'))
to_date converts from the given format, while date_format converts into the given format.
You can use date_format function present in the pyspark library.
For more information about date formats you can refer to Date Format Documentation.
Below is the code snippet to solve your usecase.
from pyspark.sql import functions as F
df = spark.createDataFrame([('2015-12-28 23:59:59',)], ['YDate'])
df = df.withColumn('YDate', F.date_format('YDate', 'dd/MM/yyy'))

pyspark How to filter rows based on HH:mm:ss portion in timestamp column

I have a dataframe in pyspark that has a timestamp string column in the following format:
"11/21/2018 07:21:49 PM"
This is in 24 hours format.
I want to filter the rows in the dataframe based on only the time portion of this string timestamp regardless of the date. For example I want to keep all rows that fall between the hours of 2:00pm and 4:00pm inclusive.
I tried the below to extract the HH:mm:ss and use the function between but it is not working.
# Grabbing only time portion from datetime column
import pyspark.sql.functions as F
time_format = "HH:mm:ss"
split_col = F.split(df['datetime'], ' ')
df = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
df = df.withColumn('Timestamp', from_unixtime(unix_timestamp('Time', format=time_format)))
df.filter(F.col("Timestamp").between('14:00:00','16:00:00')).show()
Any ideas on how to filter rows only based on the HH:mm:ss portion in a timestamp column regardless of the actual date, would be very appreciated.
Format your timestamp to HH:mm:ss then filter using between clause.
Example:
df=spark.createDataFrame([("11/21/2018 07:21:49 PM",),("11/22/2018 04:21:49 PM",),("11/23/2018 12:21:49 PM",)],["ts"])
from pyspark.sql.functions import *
df.withColumn("tt",from_unixtime(unix_timestamp(col("ts"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).\
filter(col("tt").between("12:00","16:00")).\
show()
#+----------------------+--------+
#|ts |tt |
#+----------------------+--------+
#|11/23/2018 12:21:49 PM|12:21:49|
#+----------------------+--------+

pyspark : Convert string to date format without minute, decod and hour

Hello I would like to convert string date to date format:
for example from 190424 to 2019-01-24
I try with this code :
tx_wd_df = tx_wd_df.select(
'dateTransmission',
from_unixtime(unix_timestamp('dateTransmission', 'yymmdd')).alias('dateTransmissionDATE')
)
But I got this format : 2019-01-24 00:04:00
I would like only 2019-01-24
Any idea please?
Thanks
tx_wd_df.show(truncate=False)
You can simply use to_date(). This will discard the rest of the date, and pick up only the format that matches the input date format string.
import pyspark.sql.functions as F
date_column = "dateTransmission"
# MM because mm in Java Simple Date Format is minutes, and MM is months
date_format = "yyMMdd"
df = df.withColumn(date_column, F.to_date(F.col(date_column), date_format))