I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format.
Lets say 2021-04-19.
Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")'
But now i want to create 2 variables out of it variable A and variable B such as-
Variable A= rundate- 2 weeks (should save it in YYYYMMDD format)
Variable B = rundate- 1 week (should save it in YYYYMMDD format)
and then use this variables in filtering the data in data frame.
Use datetime lib use timedelta to subtract weeks/days..etc from your rundate.
Example:
Using Python:
import datetime
varA=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=7),"%Y-%m-%d")
#'2021-04-12'
varB=datetime.datetime.strftime(datetime.datetime.strptime(rundate, "%Y-%m-%d")-datetime.timedelta(days=14),"%Y-%m-%d")
#'2021-04-05'
Using pyspark's Spark session:
rundate='2021-04-19'
varA=spark.sql(f"select string(date_sub('{rundate}',7))").collect()[0][0]
#'2021-04-12'
varB=spark.sql(f"select string(date_sub('{rundate}',14))").collect()[0][0]
#'2021-04-05'
I have 10 files created daily in my data lake with a file name like 0_2020_02_16_10_12_05.avro, since 2020_02_16 is a date that the file is created and constant for the day but the minutes and seconds are not the same for all the files , I would like to replace the 2020_02_16 by a variable in my daily run. I have tried this but it does not work
val pfdtm = ZonedDateTime.now(ZoneOffset.UTC)
val fileDate =DateTimeFormatter.ofPattern("yyyy_MM_dd").format(pfdtm)
Output:- fileDate=2020_02_16
val df=spark.read.format("com.databricks.spark.avro").load("adl://powerbi.azuredatalakestore.net/SD/eventhubspace/eventhub/0_${fileDate}_*_*_*.avro")
Any help would be appreciated. thanks
I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.
It will add a dt column in your schema with the value from the path name : dt=<value>.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)
I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour
I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()