AWS glue job to map string to date and time format while converting from csv to parquet - pyspark

while converting from csv to parquet, using AWS glue ETL job following mapped fields in csv read as string to date and time type.
this is the actual csv file
after mapping and converting, date filed is empty and time is concatenated with today's date
How to convert with proper date and time format?

It uses presto datatypes so data should be in correct format
DATE Calendar date (year, month, day).
Example: DATE '2001-08-22'
TIME Time of day (hour, minute, second, millisecond) without a time
zone. Values of this type are parsed and rendered in the session time
zone.
Example: TIME '01:02:03.456'
TIMESTAMP Instant in time that includes the date and time of day
without a time zone. Values of this type are parsed and rendered in
the session time zone.
Example: TIMESTAMP '2001-08-22 03:04:05.321'
You may use:
from pyspark.sql.functions import to_timestamp, to_date, date_format
df = df.withColumn(col, to_timestamp(col, 'dd-MM-yyyy HH:mm'))
df = df.withColumn(col, to_date(col, 'dd-MM-yyyy'))
df = df.withColumn(col, date_format(col, 'HH:mm:ss'))

Related

Timestamp converted to UTC while loading to parquet

I am loading data to parquet through spark.
dataFrame.write.parquet(path)
My data have a timestamp column while writing to parquet it is actually converting timestamp to UTC timezone and then storing.
actual time ------- 2020-10-21 00:00:00.000
UTC time--------- 2020-10-21T05:30:00.000+05:30
I see spark conf is set to UTC timezone.spark.sql.session.timeZone
Is there any way to turn off this conversion?
I want to load timestamp as is without converting it to any other timezone. how do i do that?
See documentation here:
https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html
When writing timestamp values out to non-text data sources like Parquet, the values are just instants (like timestamp in UTC) that have no time zone information. If you write and read a timestamp value with a different session time zone, you may see different values of the hour, minute, and second fields, but they are the same concrete time instant.

Copying timestamp format from avro to redshift

I am trying to copy an avro file to redshift using the COPY command. The file has a column that is of the type:
{'name': 'timestamp',
'type': ['null', {'logicalType': 'timestamp-millis', 'type': 'long'}]}],
Redshift variable type: "timestamp" timestamptz
When I run the following command copy if fails:
COPY table_name
from 'fil_path.avro'
iam_role 'the_role'
FORMAT AS avro 'auto'
raw field value: 1581306474335
Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF]
However If I add the following line It works:
timeformat 'epochmillisecs'
I tried to put my timestamp in microseconds which should be the base supported epoch resolution but it fails as well, and didn't find an appropriate name (epochmicrosecs didn't seem to do the job).
My question is why is it so?
Furthermore I have another field that is causing some problem. A date field which apparently is saved as a number of days in the avro file (7305) that gives the following error:
Redshift variable type: "birthdate" date
avro: 'date_of_birth', 'type': ['null', {'type': 'int', 'logicalType': 'date'}]}
Invalid Date Format - length must be 10 or more
Firstly, about the Time Format:
As Docs states:
COPY command attempts to implicitly convert the strings in the source data to the data type of the target column. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by specifying the following parameters.
First Solution:
Redshift Doesn't Recognize epoch time by default to be able to convert it to the format of TimeStamp as a result it can't extract year, month, day..etc from the epoch time to put them in the TimeStamp Format, as stated by the Docs:
If your source data is represented as epoch time, that is the number of seconds or milliseconds since January 1, 1970, 00:00:00 UTC, specify 'epochsecs' or 'epochmillisecs'.
This is the supported Formats that Redshift can convert Using automatic recognition.
TimeStamp needs the format to be as YYYYMMDD HHMISS = 19960108 040809 to be able to extract it right, that's what the error state Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF], while epoch time format is just seconds or milliseconds since January 1, 1970 that it doesn't understand how to extract it's values from.
microseconds isn't supported as a parameter for TIMEFORMAT in Redshift.
Second Solution:
You won't need to pass TIMEFORMAT to the COPY command, but you will insert epoch time in your staging tables as VARCHAR or TEXT.
Then, when inserting epoch time from your staging tables into the schema tables convert it like this: TIMESTAMP 'epoch' + epoch_time/1000 * interval '1 second' AS time
Secondly, about date field:
DATE data type is specified as Calendar date (year, month, day) as stated by the Docs, As a result it can't be the number of days or be less than 10 characters in length (as 2021-03-04) and that's what the error tell us Invalid Date Format - length must be 10 or more.
The solution for Date field:
You need to do a work-around, by passing the number of days as a VARCHAR or text to your staging tables.
When loading the schema tables from the staging tables, apply Data cleaning by convert number of days to a DATE using TOCHAR: TO_DATE(TO_CHAR(number of days, '9999-99-99'),'YYYY-MM-DD')
As a result, number of days will be a valid DATE in your schema tables.

Spark Scala - convert Timestamp with milliseconds to Timestamp without milliseconds

I have a column in Timestamp format that includes milliseconds.
I would like to reformat my timestamp column so that it does not include milliseconds. For example if my Timestamp column has values like 2019-11-20T12:23:13.324+0000, I would like my reformatted Timestamp column to have values of 2019-11-20T12:23:13
Is there a straight forward way to perform this operation in spark-scala? I have found lots of posts on converting string to timestamp but not for changing the format of a timestamp.
You can try trunc.
See more examples: https://sparkbyexamples.com/spark/spark-date-functions-truncate-date-time/

presto from_unixtime function is right?

i make a query about bigint to timestamp and value is '1494257400'
i will use a presto query
but presto is not collect result about from_unixtime() function.
hive version.
select from_unixtime(1494257400) result : '2017-05-09 00:30:00'
presto version.
Blockquote
select from_unixtime(1494257400) result : '2017-05-08 08:30:00'
hive gave a collect result, but presto is not collect result. how i can solve about it?
The presto from_unixtime returns you a date at UTC when the one from Hive returns you a date on your local time zone.
According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF, from_unixtime:
Converts the number of seconds from unix epoch (1970-01-01 00:00:00
UTC) to a string representing the timestamp of that moment in the
current system time zone in the format of "1970-01-01 00:00:00".
The output of Hive is not that good because ISO formatted strings should show GMT data if they have any which are not GMT+00.
With Hive, you can use to_utc_timestamp({any primitive type} ts, string timezone) to convert your timestamp to the proper timezones. Take a look at the manual whose link is provided above.

Date Column Split in Talend

So I have one big file (13 million rows) and date formatted as:
2009-04-08T01:57:47Z. Now I would like to split it into 2 columns now,
one with just date as dd-MM-yyyy and other with time only hh:MM.
How do I do it?
You can simply use tMap and parseDate/formatDate to do what you want. It is neither necessary nor recommended to implement your own date parsing logic with regexes.
First of all, parse the timestamp using the format yyyy-MM-dd'T'HH:mm:ss'Z'. Then you can use the parsed Date to output the formatted date and time information you want:
dd-MM-yyyy for the date
HH:mm for the time (Note: you mixed up the case in your question, MM stands for the month)
If you put that logic into a tMap:
you will get the following:
Input:
timestamp 2009-04-08T01:57:47Z
Output:
date 08-04-2009
time 01:57
NOTE
Note that when you parse the timestamp with the mentioned format string (yyyy-MM-dd'T'HH:mm:ss'Z'), the time zone information is not parsed (having 'Z' as a literal). Since many applications do not properly set the time zone information anyway but always use 'Z' instead, so this can be safely ignored in most cases.
If you need proper time zone handling and by any chance are able to use Java 7, you may use yyyy-MM-dd'T'HH:mm:ssXXX instead to parse your timestamp.
I'm guessing Talend is falling over on the T and Z part of your date time stamp but this is easily resolved.
As your date time stamp is in a regular pattern we can easily extract the date and time from it with a tExtractRegexFields component.
You'll want to use "^([0-9]{4}-[0-9]{2}-[0-9]{2})T([0-9]{2}:[0-9]{2}):[0-9]{2}Z" as your regex which will capture the date in yyyy-MM-dd format and the time as mm:HH (you'll want to replace the date time field with a date field and a time field in the schema).
Then to format your date to your required format you'll want to use a tMap and use TalendDate.formatDate("dd-MM-yyyy",TalendDate.parseDate("yyyy-MM-dd",row7.date)) to return a string in the dd-MM-yyyy format.