databricks partition date by week - pyspark

I'm new to databricks and trying to create partition.I've 3 columns available for partition i.e name,value and date.The required condition is that partition of date should be on weekly basis.
I've done something like this :
df\
.write\
.format("delta")\
.partitionBy(["name","value" and "date"]).
.save(writePath)
I don't know how to partition date by week ? I came across repartitionByRange but not sure how to apply for my condition ?

You can create new columns week and year from the date column and use them in partitionby:
from pyspark.sql import functions as F
df.withColumn("week", F.weekofyear("date")) \
.withColumn("year", F.year("date")) \
.write \
.format("delta") \
.partitionBy("year", "week") \
.save(writePath)
Note that the week number alone is not sufficient as it depends also on the year.

Related

need only updated quantity based on the current month using pyspark delta loads using databricks

I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt.
i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What is the issue i am facing here is the data is getting duplicated every month. i want only a single row in the delta table based on the recent crt_dt as shown in below screenshot. Can anyone help me with the solution i should implement?
The data is partitioned by crt_dt.
Thanks!
If you want to get the recent crt_dt normally this code will do the trick
w3 = Window.partitionBy("product_cat").orderBy(col("crt_dt").desc())
df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
for more details check this https://sparkbyexamples.com/pyspark/pyspark-select-first-row-of-each-group/
You have a dataset that you'd like to filter and then write out to a Delta table.
Another poster told you how to filter the data to meet your requirements. Here's how to filter the data and then write it out.
filtered_df = df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
filtered_df.write.format("delta").mode("append").save("path/to/delta_lake")
You can also do this with SQL if you aren't using the Python API.

How to filter Timestamp in Hours instead of Seconds?

timestampbefore timestamp after #pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day.
I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.
I am working on Databricks and using PySpark for the same.
i used following command to reduce my data size from years to a Day.
df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))
I would appreciate your help.
Thanks
Java.util.gregori...
You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())
df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))
Another reference: How to truncate the time on a DateTime object in Python?

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF

Scala: To check the current Timstamp is greater than a timestamp column in my dataframe

Suppose i have a dataframe in which Timestamp column is present.
Timestamp
2016-04-19T17:13:17
2016-04-20T11:31:31
2016-04-20T18:44:31
2016-04-20T14:44:01
I have to check whether current timsetamp is greater than Timestamp + 1 (i.e addition of 1 day to it) column in Scala
DataFrame supports two types of current_ on date and timestamp
Let's consider a DataFrame df with id and event_date columns.
We can perform the following filter operations :
import sqlContext.implicits._
import org.apache.spark.sql.functions._
// the event_date is before the current timestamp
df.filter('event_date.lt(current_timestamp()))
// the event_date is after the current timestamp
df.filter('event_date.gt(current_timestamp()))
I advice you to read the associated scala doc for more information here. You have a whole section on dates and timestamps operations.
EDIT: As discussed in the comments, in order to add a day to your event_date column, you can use the date_addfunction :
df.filter(date_add('event_date,1).lt(current_timestamp()))
You can do it liked this.
df.filter(date_add('column_name', 1).lt(current_timestamp()))