How to read only latest 7 days csv files from S3 bucket - scala

I am trying to figure it out, how we can read only latest 7 days file from a folder which we have in s3 bucket using Spark Scala.
Directory which we have:
Assume for today's date(Date_1) we have 2 clients and 1-1 csv file
Source/Date_1/Client_1/sample_1.csv
Source/Date_1/Client_2/sample_1.csv
Tomorrow a new folder will generate and we will get as below:
Source/Date_2/Client_1/sample_1.csv
Source/Date_2/Client_2/sample_1.csv
Source/Date_2/Client_3/sample_1.csv
Source/Date_2/Client_4/sample_1.csv
NOTE: we expecting to have newer client data added on any date.
Likewise on 7th day we can have:
Source/Date_7/Client_1/sample_1.csv
Source/Date_7/Client_2/sample_1.csv
Source/Date_7/Client_3/sample_1.csv
Source/Date_7/Client_4/sample_1.csv
So, now if we get 8th day data, We need to discard the Date_1 folder to get read.
How we can do this while reading csv files using spark scala from s3 bucket?
I am trying to read the whole "source/*" folder so that we should not miss if any client is getting added any time/day.

There are various ways to do it. One of the ways is mentioned below:
You can extract the Date from Path and the filter is based on the 7 Days.
Below is a code snippet for pyspark, the same can be implemented in Spark with Scala.
>>> from datetime import datetime, timedelta
>>> from pyspark.sql.functions import *
#Calculate date 7 days before date
>>> lastDate = datetime.now() + timedelta(days=-7)
>>> lastDate = int(lastDate.strftime('%Y%m%d'))
# Source Path
>>> srcPath = "s3://<bucket-name>/.../Source/"
>>> df1 = spark.read.option("header", "true").csv(srcPath + "*/*").withColumn("Date", split(regexp_replace(input_file_name(), srcPath, ""),"/")[0].cast("long"))
>>> df2 = df1.filter(col("Date") >= lit(lastDate))
There are few things that might change in your final implementation, such as Index value [0] that might differ if the path structure is different and the last, the condition >= that can be > based on the requirement.

Related

How to reduce a week from Rundate in AWS Glue Pyspark

I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format.
Lets say 2021-04-19.
Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")'
But now i want to create 2 variables out of it variable A and variable B such as-
Variable A= rundate- 2 weeks (should save it in YYYYMMDD format)
Variable B = rundate- 1 week (should save it in YYYYMMDD format)
and then use this variables in filtering the data in data frame.
Use datetime lib use timedelta to subtract weeks/days..etc from your rundate.
Example:
Using Python:
import datetime
varA=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=7),"%Y-%m-%d")
#'2021-04-12'
varB=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=14),"%Y-%m-%d")
#'2021-04-05'
Using pyspark's Spark session:
rundate='2021-04-19'
varA=spark.sql(f"select string(date_sub('{rundate}',7))").collect()[0][0]
#'2021-04-12'
varB=spark.sql(f"select string(date_sub('{rundate}',14))").collect()[0][0]
#'2021-04-05'

How to insert number into a file name as a variable in databricks

I have 10 files created daily in my data lake with a file name like 0_2020_02_16_10_12_05.avro, since 2020_02_16 is a date that the file is created and constant for the day but the minutes and seconds are not the same for all the files , I would like to replace the 2020_02_16 by a variable in my daily run. I have tried this but it does not work
val pfdtm = ZonedDateTime.now(ZoneOffset.UTC)
val fileDate =DateTimeFormatter.ofPattern("yyyy_MM_dd").format(pfdtm)
Output:- fileDate=2020_02_16
val df=spark.read.format("com.databricks.spark.avro").load("adl://powerbi.azuredatalakestore.net/SD/eventhubspace/eventhub/0_${fileDate}_*_*_*.avro")
Any help would be appreciated. thanks

How to read last N number of last days from the current date in parquet

I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.
It will add a dt column in your schema with the value from the path name : dt=<value>.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Pyspark windows on last 30 days on subset of data

I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()