How to do yearly comparison in spark scala - scala

I have dataframe which contains columns like Month and Qty as you can see in below table:
| Month | Fruit | Qty |
| -------- | ------ | ------ |
| 2021-01 | orange | 5223 |
| 2021-02 | orange | 23 |
| ...... | ..... | ..... |
| 2022-01 | orange | 2342 |
| 2022-02 | orange | 37667 |
I need to do sum of the Qty group by the Fruit. My output DF will be below table:
| Year | Fruit | sum_of_qty_This_year | sum_of_qty_previous_year |
| ---- | -------- | --------------------- | -------------------------- |
| 2022 | orange | 29384 | 34534 |
| 2021 | orange | 34534 | 93584 |
but there is a catch here, consider below table.
| current year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
| --------------------------------------------------------------------------------------------------------|
| previous year | jan | feb | | apr | may | jun | jul | aug | | oct | nov | dec |
as you can see the data for mar and sep is missing in previous year. So when we calculate sum of current year, Qty should exclude the missing months. and this should be done for each year

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._
val df1 = Seq(
("2021-01", "orange", 5223),
("2021-02", "orange", 23),
("2022-01", "orange", 2342),
("2022-02", "orange", 37667),
("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")
val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
.filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
.filter(col("Month").substr(1, 4) === priorYear)
.withColumnRenamed("Month", "MonthP")
.withColumnRenamed("Fruit", "FruitP")
.withColumnRenamed("Qty", "QtyP")
val resDF = priorYearDF
.join(
currentYearDF,
priorYearDF
.col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
.col("MonthP")
.substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
)
.select(
currentYearDF.col("Fruit").as("Fruit"),
currentYearDF.col("Qty").as("CurrentYearQty"),
priorYearDF.col("QtyP").as("PriorYearQty")
)
.groupBy("Fruit")
.agg(
sum("CurrentYearQty").as("sum_of_qty_This_year"),
sum("PriorYearQty").as("sum_of_qty_previous_year")
)
resDF.show(false)
// +------+--------------------+------------------------+
// |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
// +------+--------------------+------------------------+
// |orange|40009 |5246 |
// +------+--------------------+------------------------+

Related

Creating new column with values of looping through SparkSQL Dataframe

I have sparkSQL Dataframe which contains unique code, monthdate and number of turnover. I want to loop over each monthdate to get the sum of the turnover in 12 months for turnover_per_year. For example if the monthdate is January 2022 then the sum will be from January 2021 to January 2022.
For example if i have data in 2021 and 2022. The calculation of turnover_per_year in january 1st 2022 = turnover Jan 2021+ turnover feb 2021 + turnover march 2021 + ... + turnover dec 2021 + turnover jan 2022 but if i want to get the turnover_per_year in January 2021 then it will be null because i don't have the data in 2020.
This is the sample of the dataframe
+------+------------+----------+
| code | monthdate | turnover |
+------+------------+----------+
| AA1 | 2021-01-01 | 10 |
+------+------------+----------+
| AA1 | 2021-02-01 | 20 |
+------+------------+----------+
| AA1 | 2021-03-01 | 30 |
+------+------------+----------+
| AA1 | 2021-04-01 | 40 |
+------+------------+----------+
| AA1 | 2021-05-01 | 50 |
+------+------------+----------+
| AA1 | 2021-06-01 | 60 |
+------+------------+----------+
| AA1 | 2021-07-01 | 70 |
+------+------------+----------+
| AA1 | 2021-08-01 | 80 |
+------+------------+----------+
| AA1 | 2021-09-01 | 90 |
+------+------------+----------+
| AA1 | 2021-10-01 | 100 |
+------+------------+----------+
| AA1 | 2021-11-01 | 101 |
+------+------------+----------+
| AA1 | 2021-12-01 | 102 |
+------+------------+----------+
| AA1 | 2022-01-01 | 103 |
+------+------------+----------+
| AA1 | 2022-02-01 | 104 |
+------+------------+----------+
| AA1 | 2022-03-01 | 105 |
+------+------------+----------+
| AA1 | 2022-04-01 | 106 |
+------+------------+----------+
| AA1 | 2022-05-01 | 107 |
+------+------------+----------+
| AA1 | 2022-06-01 | 108 |
+------+------------+----------+
| AA1 | 2022-07-01 | 109 |
+------+------------+----------+
| AA1 | 2022-08-01 | 110 |
+------+------------+----------+
| AA1 | 2022-09-01 | 111 |
+------+------------+----------+
| AA1 | 2022-10-01 | 112 |
+------+------------+----------+
| AA1 | 2022-11-01 | 113 |
+------+------------+----------+
| AA1 | 2022-12-01 | 114 |
+------+------------+----------+
I'm very new to spark and scala and it's confusing for me to solve this in spark scala way. I have developed the logic but have difficulties to translate it to spark scala. I'm working on cluster mode. Here's my logic.
listkey = df.select("code").distinct.map(r => r(0)).collect())
listkey.foreach(key=>
df.select(*).filter("code==${key}").oderBy("monthdate").foreach(
row=>
var monthdate = row.monthdate
var turnover = row.turnover
var sum = 0
sum = sum + turnover
var n = 1
var i = 1
while (n<12){
var monthdate_temp = datetime-i
var turnover_temp =
df.select("turnover").filter("monthdate=${monthdate_temp} and code =${key}").collect()
sum = sum+turnover_temp
n =n+1
i = i+1
}
row = row.withColumn("turnover_per_year",sum)
)
)
Any help will be appreciated, thanks in advance
Each row in original dataframe can be expanded to 12 rows with back dates by "explode" function, and result joined with original dataframe, and grouped:
val df = Seq(
("AA1", "2021-01-01", 25),
("AA1", "2022-01-01", 103)
)
.toDF("code", "monthdate", "turnover")
.withColumn("monthdate", to_date($"monthdate", "yyyy-MM-dd"))
val oneYearBackMonths = (0 to 12).map(n => lit(-n))
val explodedWithBackMonths = df
.withColumn("shift", explode(array(oneYearBackMonths: _*)))
.withColumn("rangeMonth",expr("add_months(monthdate, shift)"))
val joinCondition = $"exploded.code" === $"original.code" &&
$"exploded.rangeMonth" === $"original.monthdate"
explodedWithBackMonths.alias("exploded")
.join(df.alias("original"), joinCondition)
.groupBy($"exploded.code", $"exploded.monthdate")
.agg(sum($"original.turnover").alias("oneYearTurnover"))
Result:
+----+----------+---------------+
|code|monthdate |oneYearTurnover|
+----+----------+---------------+
|AA1 |2021-01-01|25 |
|AA1 |2022-01-01|128 |
+----+----------+---------------+
you can use Spark's Window function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val raw = Seq(
("AA1", "2019-01-01", 25),
("AA1", "2021-01-01", 25),
("AA1","2021-08-01",80),
("AA1" ,"2021-09-01" , 90 ),
("AA1", "2022-01-01", 103),
("AA2", "2022-01-01", 10)
).toDF("code", "monthdate", "turnover")
val df = raw.withColumn("monthdate",to_timestamp($"monthdate","yyyy-mm-dd"))
val pw = Window.partitionBy($"code").orderBy($"monthdate".cast("long")).rangeBetween(-(86400*365), 0)
df.withColumn("sum",sum($"turnover").over(pw)).show()
+----+-------------------+--------+---+
|code| monthdate|turnover|sum|
+----+-------------------+--------+---+
| AA1|2019-01-01 00:01:00| 25| 25|
| AA1|2021-01-01 00:01:00| 25| 25|
| AA1|2021-01-01 00:08:00| 80|105|
| AA1|2021-01-01 00:09:00| 90|195|
| AA1|2022-01-01 00:01:00| 103|298|
| AA2|2022-01-01 00:01:00| 10| 10|
+----+-------------------+--------+---+
I created 2 window functions for testing, can you please check this and comment whether this is fine.
val w = Window.partitionBy($"code")
.orderBy($"rownum")
.rowsBetween(-11, Window.currentRow)
val w1 = Window.partitionBy($"code")
.orderBy($"monthdate")
val newDf = initDf.withColumn("rownum", row_number().over(w1))
.withColumn("csum",sum("turnover").over(w))
We may need to first group by Month & Year and take the sum of the
turnover for that month, then sort by month for that code

Tableau - How check if a value equals a value from another row and column

I have the following table:
+------------+--------------+---------+---------+---------+
| Category | Subcategory |FruitName| Date1 | Date2 |
+------------+--------------+---------+---------+---------+
| A | 1 | Foo | 2011 | 2017 |
| | +---------+---------+---------+
| | |Pineapple| 2011 | 2013 |
| | +---------+---------+---------+
| | | Apple | 2017 | 2018 |
| +--------------+---------+---------+---------+
| | 2 | Peach | 2014 | 2015 |
| | +---------+---------+---------+
| | | Orange | 2015 | 2018 |
| | +---------+---------+---------+
| | | Banana | 2009 | 2013 |
+------------+--------------+---------+---------+---------+
I'd like to display the fruit names where Date1 from one row == Date2 from another row, but only if they are equals within the same Subcategory. In the table above, this filter should retrieve the rows based on those criterias:
And the final table would look like this:
+------------+--------------+---------+---------+---------+
| Category | Subcategory |FruitName| Date1 | Date2 |
+------------+--------------+---------+---------+---------+
| A | 1 | Foo | 2011 | 2017 |
| | +---------+---------+---------+
| | | Apple | 2017 | 2018 |
| +--------------+---------+---------+---------+
| | 2 | Peach | 2014 | 2015 |
| | +---------+---------+---------+
| | | Orange | 2015 | 2018 |
+------------+--------------+---------+---------+---------+
How can I possibly achieve this?
Your logic provided doesnot match with the output provided. If you are after the output, your logic should be:
SELECT f1.* from fruits f1 JOIN fruits f2
ON f1.Subcategory=f2.Subcategory
WHERE f1.Date1=f2.Date2 OR f1.Date2 = f2.Date1;
If your data source supports custom SQL, you can straight away use the above query. If not you can still achieve it in Tableau using a Full Outer Join and a calculated Field.(Tableau doesn't support OR condition in Joins.)
Create a self full outerjoin with the following criteria
Create a calculation called 'FILTER' as below
Apply a datasource filter to keep only 'FILTER' = True
Hide Fields from the rightside connection and you will have the required output.

how to compare two data frames in scala

I have two exactly same dataframes for comparison test
df1
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 12332 | 53255 | 55324 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 14463 | 76543 | 66433 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
df2
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 65333 | 65555 | 125 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 533 | 75 | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want to compare with these two dfs on count2 to count4, if the counts doesn't match then print out some message saying it is mismatching.
here is my try
val cols = df1.columns.filter(_ != "year").toList
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = df1.as("l").join(df2.as("r"), "year").select($"year" :: cols.map(mapDiffs): _*)
it then compares with the same state with the same number, it didn't do what I wanted to do
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | no | no | no |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | no | no | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want the result to come out as above, how do I achieve that?
edits, also in a different scenario if I want to compare only in one df, col to cols how do I do that?
like
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
I want to compare count3 and count 4 cols to count2, obviously count3 and count 4 do not match count 2, so I want the result to be
-----------------------------------------------
year | state | count2 | count3 | count4 |
2014 | NJ | 12332 | mismatch | mismatch |
Thank you!
The dataframe join on year won't work for your mapDiffs method. You need a row-identifying column in df1 and df2 for the join.
import org.apache.spark.sql.functions._
val df1 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "55324"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "14463", "76543", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
val df2 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "125"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "533", "75", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
Skip this if you already have a row-identifying column (say, rowId) in the dataframes for thejoin:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1i = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2i = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
Now, define mapDiffs and apply it to the selected columns after joining the dataframes by rowId:
def mapDiffs(name: String) =
when($"l.$name" === $"r.$name", $"l.$name").otherwise("no").as(name)
val cols = df1i.columns.filter(_.startsWith("count")).toList
val result = df1i.as("l").join(df2i.as("r"), "rowId").
select($"l.rowId" :: $"l.year" :: cols.map(mapDiffs): _*)
// +-----+----+------+------+------+
// |rowId|year|count2|count3|count4|
// +-----+----+------+------+------+
// | 0|2014| 12332| 54322| 53422|
// | 5|2016| 55325| 76543| 66433|
// | 1|2014| 12332| 53255| no|
// | 3|2015| no| no| 64524|
// | 2|2015| 12332| 53255| 55324|
// | 4|2016| 14463| 76543| 66433|
// +-----+----+------+------+------+
Note that there appears to be more discrepancies between df1 and df2 than just the 3 no-spots in your sample result. I've modified the sample data to make those 3 spots the only difference.

Calculate LAG variable after filtering in Tableau

I have a dataset with 4 columns: ID (unique identifier of user), Year, Country and Level in this format:
+----+------+---------+-------+
| ID | Year | Country | Level |
+----+------+---------+-------+
| 1 | 2015 | USA | 1 |
| 1 | 2016 | China | 2 |
| 2 | 2015 | China | 2 |
| 2 | 2016 | Russia | 2 |
| 3 | 2015 | Russia | 1 |
| 3 | 2016 | China | 2 |
| 4 | 2015 | USA | 2 |
| 4 | 2016 | USA | 3 |
| 5 | 2014 | China | 1 |
| 5 | 2016 | USA | 2 |
| 6 | 2015 | USA | 1 |
| 6 | 2016 | USA | 2 |
| 7 | 2015 | Russia | 2 |
| 7 | 2016 | China | 3 |
+----+------+---------+-------+
The user will be able to filter the dataset by country.
I want to create a table using the country filter that shows in a column if a user was the previous year in any of the countries selected aggregated by the level variable, apart from other variables only affected by the current country filter.
For example E.g., if I select China and USA:
+----+------+---------+-------+-----------------+
| ID | Year | Country | Level | In selection PY |
+----+------+---------+-------+-----------------+
| 1 | 2015 | USA | 1 | No |
| 1 | 2016 | China | 2 | Yes |
| 2 | 2015 | China | 2 | No |
| 3 | 2016 | China | 2 | No |
| 4 | 2015 | USA | 2 | No |
| 4 | 2016 | USA | 3 | Yes |
| 5 | 2014 | China | 1 | No |
| 5 | 2016 | USA | 2 | No |
| 6 | 2015 | USA | 1 | No |
| 6 | 2016 | USA | 2 | Yes |
| 7 | 2016 | China | 3 | No |
+----+------+---------+-------+-----------------+
The aggregated result will be:
+-------+-------------------+-----------------+
| Level | Number of records | In selection PY |
+-------+-------------------+-----------------+
| 1 | 3 | 0 |
| 2 | 6 | 2 |
| 3 | 2 | 1 |
+-------+-------------------+-----------------+
Do you know any way to calculate this aggregated table efficiently? (this would be done in a dataset with millions of rows, with a variable set of countries to be selected)
I found a solution, will post in case it is helpful for someone else:
I change the Country filter to "Add to Context" and created this variable:
In Selection PY: if Year = 2016 then
{fixed [ID]:min(if Year = 2015 then 1 END)}
elseif Year = 2015 then
{fixed [ID]:min(if Year = 2014 then 1 END)}
elseif Year = 2014 then
{fixed [ID]:min(if Year = 2013 then 1 END)}
In this way the variable Selection PY is dynamically calculated according to the country filter.
It is only necessary to know in advance which years are stored in the dataset (or add more years to be safe).

Symfony2 Query to find last working date from Holiday Calender

I had a calender entity in my project which manages the open and close time of business day of the whole year.
Below is the record of a specific month
id | today_date | year | month_of_year | day_of_month | is_business_day
-------+---------------------+------+---------------+-------------+---------------+
10103 | 2016-02-01 00:00:00 | 2016 | 2 | 1 | t
10104 | 2016-02-02 00:00:00 | 2016 | 2 | 2 | t
10105 | 2016-02-03 00:00:00 | 2016 | 2 | 3 | t
10106 | 2016-02-04 00:00:00 | 2016 | 2 | 4 | t
10107 | 2016-02-05 00:00:00 | 2016 | 2 | 5 | t
10108 | 2016-02-06 00:00:00 | 2016 | 2 | 6 | f
10109 | 2016-02-07 00:00:00 | 2016 | 2 | 7 | f
10110 | 2016-02-08 00:00:00 | 2016 | 2 | 8 | t
10111 | 2016-02-09 00:00:00 | 2016 | 2 | 9 | t
10112 | 2016-02-10 00:00:00 | 2016 | 2 | 10 | t
10113 | 2016-02-11 00:00:00 | 2016 | 2 | 11 | t
10114 | 2016-02-12 00:00:00 | 2016 | 2 | 12 | t
10115 | 2016-02-13 00:00:00 | 2016 | 2 | 13 | f
10116 | 2016-02-14 00:00:00 | 2016 | 2 | 14 | f
10117 | 2016-02-15 00:00:00 | 2016 | 2 | 15 | t
10118 | 2016-02-16 00:00:00 | 2016 | 2 | 16 | t
10119 | 2016-02-17 00:00:00 | 2016 | 2 | 17 | t
10120 | 2016-02-18 00:00:00 | 2016 | 2 | 18 | t
I want the get the today_date of last 7 working date. Supporse today_date is 2016-02-18 and date of last 7 working dates as 2016-02-09.
You can use row_number() for this like this:
SELECT * FROM
(SELECT t.*,row_number() OVER(order by today_date desc) as rnk
FROM Calender t
WHERE today_date <= current_date
AND is_business_day = 't')
WHERE rnk = 7
This will give you the row of the 7th business day from todays date
I see that you tagged your question with Doctrine, ORM and Datetime. Were you after a QueryBuilder solution? Maybe this is closer to what you want:
$qb->select('c.today_date')
->from(Calendar::class, 'c')
->where("c.today_date <= :today")
->andWhere("c.is_business_day = 't'")
->setMaxResults(7)
->orderBy("c.today_date", "DESC")
->setParameter('today', new \DateTime('now'), \Doctrine\DBAL\Types\Type::DATETIME));