Spark - How to get the latest hour in S3 path? - scala

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?

Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Related

How to read only latest 7 days csv files from S3 bucket

I am trying to figure it out, how we can read only latest 7 days file from a folder which we have in s3 bucket using Spark Scala.
Directory which we have:
Assume for today's date(Date_1) we have 2 clients and 1-1 csv file
Source/Date_1/Client_1/sample_1.csv
Source/Date_1/Client_2/sample_1.csv
Tomorrow a new folder will generate and we will get as below:
Source/Date_2/Client_1/sample_1.csv
Source/Date_2/Client_2/sample_1.csv
Source/Date_2/Client_3/sample_1.csv
Source/Date_2/Client_4/sample_1.csv
NOTE: we expecting to have newer client data added on any date.
Likewise on 7th day we can have:
Source/Date_7/Client_1/sample_1.csv
Source/Date_7/Client_2/sample_1.csv
Source/Date_7/Client_3/sample_1.csv
Source/Date_7/Client_4/sample_1.csv
So, now if we get 8th day data, We need to discard the Date_1 folder to get read.
How we can do this while reading csv files using spark scala from s3 bucket?
I am trying to read the whole "source/*" folder so that we should not miss if any client is getting added any time/day.
There are various ways to do it. One of the ways is mentioned below:
You can extract the Date from Path and the filter is based on the 7 Days.
Below is a code snippet for pyspark, the same can be implemented in Spark with Scala.
>>> from datetime import datetime, timedelta
>>> from pyspark.sql.functions import *
#Calculate date 7 days before date
>>> lastDate = datetime.now() + timedelta(days=-7)
>>> lastDate = int(lastDate.strftime('%Y%m%d'))
# Source Path
>>> srcPath = "s3://<bucket-name>/.../Source/"
>>> df1 = spark.read.option("header", "true").csv(srcPath + "*/*").withColumn("Date", split(regexp_replace(input_file_name(), srcPath, ""),"/")[0].cast("long"))
>>> df2 = df1.filter(col("Date") >= lit(lastDate))
There are few things that might change in your final implementation, such as Index value [0] that might differ if the path structure is different and the last, the condition >= that can be > based on the requirement.

How to reduce a week from Rundate in AWS Glue Pyspark

I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format.
Lets say 2021-04-19.
Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")'
But now i want to create 2 variables out of it variable A and variable B such as-
Variable A= rundate- 2 weeks (should save it in YYYYMMDD format)
Variable B = rundate- 1 week (should save it in YYYYMMDD format)
and then use this variables in filtering the data in data frame.
Use datetime lib use timedelta to subtract weeks/days..etc from your rundate.
Example:
Using Python:
import datetime
varA=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=7),"%Y-%m-%d")
#'2021-04-12'
varB=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=14),"%Y-%m-%d")
#'2021-04-05'
Using pyspark's Spark session:
rundate='2021-04-19'
varA=spark.sql(f"select string(date_sub('{rundate}',7))").collect()[0][0]
#'2021-04-12'
varB=spark.sql(f"select string(date_sub('{rundate}',14))").collect()[0][0]
#'2021-04-05'

How to filter Timestamp in Hours instead of Seconds?

timestampbefore timestamp after #pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day.
I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.
I am working on Databricks and using PySpark for the same.
i used following command to reduce my data size from years to a Day.
df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))
I would appreciate your help.
Thanks
Java.util.gregori...
You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())
df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))
Another reference: How to truncate the time on a DateTime object in Python?

Spark avro predicate pushdown

We are using Avro data format and the data is partitioned by year, month, day, hour, min
I see the data stored in HDFS as
/data/year=2018/month=01/day=01/hour=01/min=00/events.avro
And we load the data using
val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")
And then using predicate push down for filtering the data -
var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)
Can someone explain what is happening behind the scenes?
I want to specifically understand when does the filtering of input files happen and where?
Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields?
Want to understand clearly how files are filtered and how the partitions are created.

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF