How to get year and week number aligned for a date - scala

While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year.
I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber,
For example, running:
spark.sql("select year('2017-01-01') as year, weekofyear('2017-01-01') as weeknumber").show(false)
outputs:
+----+----------+
|year|weeknumber|
+----+----------+
|2017|52 |
+----+----------+
But the wanted output would be:
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
and running:
spark.sql("select year('2018-12-31') as year, weekofyear('2018-12-31') as weeknumber").show(false)
produces:
+----+----------+
|year|weeknumber|
+----+----------+
|2018|1 |
+----+----------+
But what is expected is:
+----+----------+
|year|weeknumber|
+----+----------+
|2019|1 |
+----+----------+
Code is running on Spark 2.4.2.

This spark behavior is consistent with the ISO 8601 definition. You can not change it. However there is a workaround I could think of.
You can first determine dayOfWeek, and if it is less than 4, you increase the year by one, if it equals to 4 then keep the year untouched. Otherwise decrease the year by one.
Example with 2017-01-01
sql("select case when date_format('2017-01-01', 'u') < 4 then year('2017-01-01')+1 when date_format('2017-01-01', 'u') = 4 then year('2017-01-01') else year('2017-01-01')- 1 end as year, weekofyear('2017-01-01') as weeknumber, date_format('2017-01-01', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2016|52 |7 |
+----+----------+---------+
Example with 2018-12-31
sql("select case when date_format('2018-12-31', 'u') < 4 then year('2018-12-31')+1 when date_format('2018-12-31', 'u') = 4 then year('2018-12-31') else year('2018-12-31')- 1 end as year, weekofyear('2018-12-31') as weeknumber, date_format('2018-12-31', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2019|1 |1 |
+----+----------+---------+

val df = Seq(("2017-01-01"), ("2018-12-31")).toDF("dateval")
+----------+
| dateval|
+----------+
|2017-01-01|
|2018-12-31|
+----------+
df.createOrReplaceTempView("date_tab")
val newDF = spark.sql("""select dateval,
case when weekofyear(dateval)=1 and month(dateval)=12 then struct((year(dateval)+1) as yr, weekofyear(dateval) as wk)
when weekofyear(dateval)=52 and month(dateval)=1 then struct((year(dateval)-1) as yr, weekofyear(dateval) as wk)
else struct((year(dateval)) as yr, weekofyear(dateval) as wk) end as week_struct
from date_tab""");
newDF.select($"dateval", $"week_struct.yr", $"week_struct.wk").show()
+----------+----+---+
| dateval| yr| wk|
+----------+----+---+
|2017-01-01|2016| 52|
|2018-12-31|2019| 1|
+----------+----+---+

You can also use a UDF to achieve this
import org.apache.spark.sql.types._
import java.time.temporal.IsoFields
def weekYear(date: java.sql.Date) : Option[Int] = {
if(date == null) None
else Some(date.toLocalDate.get(IsoFields.WEEK_BASED_YEAR))
}
Register this udf as
spark.udf.register("yearOfWeek", weekYear _)
Result:-
scala> spark.sql("select yearOfWeek('2017-01-01') as year, WEEKOFYEAR('2017-01-01') as weeknumber").show(false)
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+

Related

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

Spark scala dynamically create filter condition value from sequence Pair

I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+

Spark Scala Cumulative Unique Count by Date

I have a dataframe that gives a set of id numbers and the date at which they visited a certain location and I’m trying to find a way in spark scala to get the number of unique people (“id”) that have visited this location on or before each day so that one id number won’t be counted twice if they visit on 2019-01-01 and then again on 2019-01-07 for example.
df.show(5,false)
+---------------+
|id |date |
+---------------+
|3424|2019-01-02|
|8683|2019-01-01|
|7690|2019-01-02|
|3424|2019-01-07|
|9002|2019-01-02|
+---------------+
I want the output to look like this: where I groupBy(“date”) and get the count of unique id’s as a cumulative number. (So for example: next to 2019-01-03, it would give the distinct count of id’s on any day up to 2019-01-03)
+----------+-------+
|date |cum_ct |
+----------+-------+
|2019-01-01|xxxxx |
|2019-01-02|xxxxx |
|2019-01-03|xxxxx |
|... |... |
|2019-01-08|xxxxx |
|2019-01-09|xxxxx |
+------------------+
What would be the best way to do this after df.groupBy("date")
You will have to use the ROW_NUMBER() function in this scenario. I have created a dataframe
val df = Seq((1,"2019-05-03"),(1,"2018-05-03"),(2,"2019-05-03"),(2,"2018-05-03"),(3,"2019-05-03"),(3,"2018-05-03")).toDF("id","date")
df.show
+---+----------+
| id| date|
+---+----------+
| 1|2019-05-03|
| 1|2018-05-03|
| 2|2019-05-03|
| 2|2018-05-03|
| 3|2019-05-03|
| 3|2018-05-03|
+---+----------+
ID represents a person id in your case that can appear against multiple dates.
Here is the count against each date.
df.groupBy("date").count.show
+----------+-----+
| date|count|
+----------+-----+
|2018-05-03| 3|
|2019-05-03| 3|
+----------+-----+
This shows the repetitive count of id's against each date. I have used 3 id's in total and each date has a count of 3 that means all id's are counted explicitly in each date.
Now to my understanding you want an ID to be counted only once against any date (depends if you want latest date or oldest date).
I am going to use latest date for every ID.
val newdf = df.withColumn("row_num",row_number().over(Window.partitionBy($"id").orderBy($"date".desc)))
Above line will assign row numbers against every ID for each date against it's entry, and row number 1 will refer to the latest date of each ID, now you take count against each ID where row number is 1. That will result in single count of every ID (Distinct).
Here is the output , I have applied filter against row number and you can see in output that the dates are latest i.e in my case 2019.
newdf.select("id","date","row_num").where("row_num = 1").show()
+---+----------+-------+
| id| date|row_num|
+---+----------+-------+
| 1|2019-05-03| 1|
| 3|2019-05-03| 1|
| 2|2019-05-03| 1|
+---+----------+-------+
Now i will take count on NEWDF with same filter which will return date wise count.
newdf.groupBy("date","row_num").count().filter("row_num = 1").select("date","count").show
+----------+-----+
| date|count|
+----------+-----+
|2019-05-03| 3|
+----------+-----+
Here total count is 3 which excludes ID's on previous dates, previously it was 6 (because repetition of id in multiple date)
I hope it answers your questions.

Pyspark - Create Equivalent of Business current view in pyspark

I need to create an equivalent of bussiness current view in pyspark , I have an history file and a delta file(containing id and date) .I need to create final dataframe which will have the single record for each id and that record should be of latest date .
df1=sql_context.createDataFrame([("3000", "2017-04-19"), ("5000", "2017-04-19"), ("9012", "2017-04-19")], ["id", "date"])
df2=sql_context.createDataFrame([("3000", "2017-04-18"), ("5120", "2017-04-18"), ("1012", "2017-04-18")], ["id", "date"])
df3=df2.union(df1).distinct()
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|3000|2017-04-18|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+
I tried doing a union and do a distinct , it gives me id=3000 for both the dates where as I need only record for id=300 for date=2017-04-19
Even subtract doesnt work since it returns all the rows of either of the df's .
Desired output:-
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+
Hope this helps!
from pyspark.sql.functions import unix_timestamp, col, to_date, max
#sample data
df1=sqlContext.createDataFrame([("3000", "2017-04-19"),
("5000", "2017-04-19"),
("9012", "2017-04-19")],
["id", "date"])
df2=sqlContext.createDataFrame([("3000", "2017-04-18"),
("5120", "2017-04-18"),
("1012", "2017-04-18")],
["id", "date"])
df=df2.union(df1)
df.show()
#convert 'date' column to date type so that latest date can be fetched for an ID
df = df.\
withColumn('date_inDateFormat',to_date(unix_timestamp(col('date'),"yyyy-MM-dd").cast("timestamp"))).\
drop('date')
#get latest date for an ID
df = df.groupBy('id').agg(max('date_inDateFormat').alias('date'))
df.show()
Output is:
+----+----------+
| id| date|
+----+----------+
|5000|2017-04-19|
|1012|2017-04-18|
|5120|2017-04-18|
|9012|2017-04-19|
|3000|2017-04-19|
+----+----------+
Note: Please don't forget to let SO know if the answer helps you solve your problem.

Select Records Using Relative Dates in Hive / Scala

Is there a good way to solve this problem: Suppose I want to select records that are at least 6 months prior to the previously selected record for a given grouping.
Ie. I have:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-10-01 00:00:00
1 A 2014-05-01 00:00:00
1 A 2014-01-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-10-01 00:00:00
2 A 2014-01-01 00:00:00
2 A 2013-10-01 00:00:00
I'd like to select only dates that are at least 6 months apart relative to the previously selected one. Ie it will return:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-05-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-01-01 00:00:00
It is obvious to me how to do this using orderings if you wanted to select relative to the latest ones
(ie:
SELECT b.date, b..., a.latest_date
FROM(
SELECT *, row_number OVER PARTITION BY Col A, Col B ORDER BY Date as row_number
FROM table1) temp
WHERE row_number = 1) a
INNER JOIN TABLE 1 b
ON KEY)
WHERE datediff(date, latestdate)/365 > 0.5
or so
, but I'm a little unclear how you would do this relative to each other. Is there a way to do this recursively in Hive / Scala or something?
Hi there is a concept of windowing lag and lead concept is both hive and spark, you can achieve this task in both. Here is the code in spark.
val data = sc.parallelize(List(("1", "A", "2015-01-01 00:00:00"),
| | ("1", "A", "2014-10-01 00:00:00"),
| | ("1", "A", "2014-01-01 00:00:00"),
| | ("1", "B", "2014-01-01 00:00:00"),
| | ("2", "A", "2015-01-01 00:00:00"),
| | ("2", "A", "2014-10-01 00:00:00"),
| | ("2", "A", "2014-01-01 00:00:00"),
| | ("2", "A", "2013-10-01 00:00:00")
| | )).toDF("id","status","Date");
val data2 =data.select($"id",$"status",to_date($"Date").alias(date));
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec3 = Window.partitionBy("id","status").orderBy(desc("date"));
val data3 = data2.withColumn("diff",datediff(lag(data2("date"), 1).over(wSpec3),$"date")).filter($"diff">182.5 || $"diff".isNull);
data3.show
+---+------+----------+----+
| id|status| date|diff|
+---+------+----------+----+
| 1| A|2015-01-01|null|
| 1| A|2014-01-01| 273|
| 1| B|2014-01-01|null|
| 2| A|2015-01-01|null|
| 2| A|2014-01-01| 273|
+---+------+----------+----+