How to convert string to date in AWS Glue - pyspark

I have a csv file in AWS S3, when I execute a crawler the field with dates like 01/01/2016 04.21 is taken as a string in AWS glue.
How can I change it to date type?
I tried with the "modify schema" button in AWS Glue but it ends up having a blank field.

Convert the dynamic frame to a Pyspark dataframe and use Pyspark for everything. easier:
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
df= dyf.toDF()
df = df.withColumn(col(columnname), from_unixtime(unix_timestamp(col(columnname),"dd/MM/yyyy hh.mm")))

Related

How to reduce a week from Rundate in AWS Glue Pyspark

I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format.
Lets say 2021-04-19.
Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")'
But now i want to create 2 variables out of it variable A and variable B such as-
Variable A= rundate- 2 weeks (should save it in YYYYMMDD format)
Variable B = rundate- 1 week (should save it in YYYYMMDD format)
and then use this variables in filtering the data in data frame.
Use datetime lib use timedelta to subtract weeks/days..etc from your rundate.
Example:
Using Python:
import datetime
varA=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=7),"%Y-%m-%d")
#'2021-04-12'
varB=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=14),"%Y-%m-%d")
#'2021-04-05'
Using pyspark's Spark session:
rundate='2021-04-19'
varA=spark.sql(f"select string(date_sub('{rundate}',7))").collect()[0][0]
#'2021-04-12'
varB=spark.sql(f"select string(date_sub('{rundate}',14))").collect()[0][0]
#'2021-04-05'

Convert to date in cloud datafusion

How do we convert a string to date in cloud datafusion?
I have a column with the value say 20191120 (format of yyyyMMdd) i want to load this into a table in bigquery as date. The table column datatype is also date.
What i have tried so far is that i converted the string to timestamp using "parse-as-simple-date" and i try to convert it to format using format-date to "yyyy-MM-dd", but this step converts it to string and the final load fails. I have even tried to explicitly mention the column as date in the o/p schema as date. But it fails at runtime.
I tried keeping it as timestamp in the pipeline and try loading the date into Bigquery date type.
I noticed in the error that came op was field dt_1 incompatible with avro integer. Is datafusion internally converting the extract into avro before loading. AVRO does not have a date datatype which is causing the isssue?
Adding answer for posterity:
You can try doing these,
Go to LocalDateTime column in wrangler
Open dropdown and click on "Custom Transform"
Type timestamp.toLocalDate() (timestamp being the column name)
After the last step it should convert it into LocalDate type which you can write to bigquery. Hope this helps
For this specific date format, the Wrangler Transform directive would be:
parse-as-simple-date date_field_dt yyyyMMdd
set-column date_field_dt date_field_dt.toLocalDate()
The second line is required if the destination is of type Date.
Skip empty values:
set-column date_field_dt empty(date_field_dt) ? date_field_dt : date_field_dt.toLocalDate()
References:
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-simple-date.md
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-date.md
You could try to parse your input data with Data Fusion using Wrangler.
In order to test it out I have replicated a workflow where a Data Fusion pipeline is fed with data coming from BigQuery. This data is then parsed to the proper type and then it is exported back again to BigQuery. Note that the public dataset is “austin_311” and I have used ‘’311_request’ table as some of their columns are TIMESTAMP type.
The steps I have done are the following:
I have queried a public dataset that contained TIMESTAMP data using:
select * from `bigquery-public-data.austin_311.311_request`
limit 1000;
I have uploaded it to Google Cloud Storage.
I have created a new Data Fusion batch pipeline following this.
I have used the Wrangler to Parse CSV data to custom 'Simple Data' yyyy-MM-dd HH:mm:ss
I have exported Pipeline results to BigQuery.
This qwiklab has helped me through the steps.
Result:
Following the above procedure I have been able to export Data Fusion data to BigQuery and the DATE fields are exported as TIMESTAMP, as expected.

PySpark - Spark SQL: how to convert timestamp with UTC offset to epoch/unixtime?

How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?
The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.
Any suggestion on how I could approach that using preferably Spark SQL or PySpark?
Thanks
The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.
So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.
SparkSQL
spark.sql(
"""select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX")
AS epoch"""
).show(truncate=False)
#+----------+
#|epoch |
#+----------+
#|1566543477|
#+----------+
Spark DataFrame
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
"unixtime",
unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp |unixtime |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+
Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF