how to compare two data frames in scala

how to compare two data frames in scala - scala

I have two exactly same dataframes for comparison test
df1
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 12332 | 53255 | 55324 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 14463 | 76543 | 66433 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
df2
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 65333 | 65555 | 125 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 533 | 75 | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want to compare with these two dfs on count2 to count4, if the counts doesn't match then print out some message saying it is mismatching.
here is my try
val cols = df1.columns.filter(_ != "year").toList
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = df1.as("l").join(df2.as("r"), "year").select($"year" :: cols.map(mapDiffs): _*)
it then compares with the same state with the same number, it didn't do what I wanted to do
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | no | no | no |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | no | no | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want the result to come out as above, how do I achieve that?
edits, also in a different scenario if I want to compare only in one df, col to cols how do I do that?
like
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
I want to compare count3 and count 4 cols to count2, obviously count3 and count 4 do not match count 2, so I want the result to be
-----------------------------------------------
year | state | count2 | count3 | count4 |
2014 | NJ | 12332 | mismatch | mismatch |
Thank you!

The dataframe join on year won't work for your mapDiffs method. You need a row-identifying column in df1 and df2 for the join.
import org.apache.spark.sql.functions._
val df1 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "55324"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "14463", "76543", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
val df2 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "125"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "533", "75", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
Skip this if you already have a row-identifying column (say, rowId) in the dataframes for thejoin:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1i = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2i = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
Now, define mapDiffs and apply it to the selected columns after joining the dataframes by rowId:
def mapDiffs(name: String) =
when($"l.$name" === $"r.$name", $"l.$name").otherwise("no").as(name)
val cols = df1i.columns.filter(_.startsWith("count")).toList
val result = df1i.as("l").join(df2i.as("r"), "rowId").
select($"l.rowId" :: $"l.year" :: cols.map(mapDiffs): _*)
// +-----+----+------+------+------+
// |rowId|year|count2|count3|count4|
// +-----+----+------+------+------+
// | 0|2014| 12332| 54322| 53422|
// | 5|2016| 55325| 76543| 66433|
// | 1|2014| 12332| 53255| no|
// | 3|2015| no| no| 64524|
// | 2|2015| 12332| 53255| 55324|
// | 4|2016| 14463| 76543| 66433|
// +-----+----+------+------+------+
Note that there appears to be more discrepancies between df1 and df2 than just the 3 no-spots in your sample result. I've modified the sample data to make those 3 spots the only difference.

Related

How to do yearly comparison in spark scala

I have dataframe which contains columns like Month and Qty as you can see in below table:
| Month | Fruit | Qty |
| -------- | ------ | ------ |
| 2021-01 | orange | 5223 |
| 2021-02 | orange | 23 |
| ...... | ..... | ..... |
| 2022-01 | orange | 2342 |
| 2022-02 | orange | 37667 |
I need to do sum of the Qty group by the Fruit. My output DF will be below table:
| Year | Fruit | sum_of_qty_This_year | sum_of_qty_previous_year |
| ---- | -------- | --------------------- | -------------------------- |
| 2022 | orange | 29384 | 34534 |
| 2021 | orange | 34534 | 93584 |
but there is a catch here, consider below table.
| current year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
| --------------------------------------------------------------------------------------------------------|
| previous year | jan | feb | | apr | may | jun | jul | aug | | oct | nov | dec |
as you can see the data for mar and sep is missing in previous year. So when we calculate sum of current year, Qty should exclude the missing months. and this should be done for each year

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._
val df1 = Seq(
("2021-01", "orange", 5223),
("2021-02", "orange", 23),
("2022-01", "orange", 2342),
("2022-02", "orange", 37667),
("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")
val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
.filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
.filter(col("Month").substr(1, 4) === priorYear)
.withColumnRenamed("Month", "MonthP")
.withColumnRenamed("Fruit", "FruitP")
.withColumnRenamed("Qty", "QtyP")
val resDF = priorYearDF
.join(
currentYearDF,
priorYearDF
.col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
.col("MonthP")
.substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
)
.select(
currentYearDF.col("Fruit").as("Fruit"),
currentYearDF.col("Qty").as("CurrentYearQty"),
priorYearDF.col("QtyP").as("PriorYearQty")
)
.groupBy("Fruit")
.agg(
sum("CurrentYearQty").as("sum_of_qty_This_year"),
sum("PriorYearQty").as("sum_of_qty_previous_year")
)
resDF.show(false)
// +------+--------------------+------------------------+
// |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
// +------+--------------------+------------------------+
// |orange|40009 |5246 |
// +------+--------------------+------------------------+

Join a DF with another two with conditions - Scala Spark

I am trying to join a DF with another two using a condition. I have the following DF's.
DF1, the DF that I want to join with df_cond1 and df_cond2.
If DF1 InfoNum col is NBC I want to join with df_cond1 else if DF1 InfoNum Column is BBC I want to join with df_cond2 but I don't know how can I do this.
DF1
+-------------+----------+-------------+
| Date | InfoNum | Sport |
+-------------+----------+-------------+
| 31/11/2020 | NBC | football |
| 11/01/2020 | BBC | tennis |
+-------------+----------+-------------+
df_cond1
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Monthly | NBC | DATAquality |
+-------------+---------+-------------+
df_cond2
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Daily | BBC | InfoIndeed |
+-------------+---------+-------------+
final_df
+-------------+----------+-------------+-------------+
| Date | InfoNum | Sport | Description |
+-------------+----------+-------------+-------------+
| 31/11/2020 | NBC | football | DATAquality |
| 11/01/2020 | BBC | tennis | InfoIndeed |
+-------------+----------+-------------+-------------+
I have been searching but didn't find a good solution, can you help me?

Here is how you can join
val df = Seq(
("31/11/2020", "NBC", "football"),
("1/01/2020", "BBC", "tennis")
).toDF("Date", "InfoNum", "Sport")
val df_cond1 = Seq(
("Monthly", "NBC", "DATAquality")
).toDF("Periodicity", "Info", "Description")
val df_cond2 = Seq(
("Daily", "BBC", "InfoIndeed")
).toDF("Periodicity", "Info", "Description")
df.join(df_cond1.union(df_cond2), $"InfoNum" === $"Info")
.drop("Info", "Periodicity")
.show(false)
Output:
+----------+-------+--------+-----------+
|Date |InfoNum|Sport |Description|
+----------+-------+--------+-----------+
|31/11/2020|NBC |football|DATAquality|
|1/01/2020 |BBC |tennis |InfoIndeed |
+----------+-------+--------+-----------+

In Spark scala, how to check between adjacent rows in a dataframe

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!

Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ？

I have two spark dataframe，dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.

Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.

It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful

Scala RDD count by range

I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).
My initial RDD[MyObject] :
|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 1 | 1 |
| | |----|-----|
| | | 2 | 1 |
| | |----|-----|
| | | 3 | 50 |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 5 | 40 |
| | |----|-----|
| | | 6 | 41 |
| | |----|-----|
| | | 7 | 2 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 9 | 15 |
| | |----|-----|
| | | 10| 16 |
| | |----|-----|
| | | 11| 46 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 13| 7 |
| | |----|-----|
| | | 14| 9 |
| | |----|-----|
| | | 15| 60 |
|-----------|---------|----|-----|
| Barcelona | London | ID | Age |
| | |----|-----|
| | | 17| 66 |
| | |----|-----|
| | | 18| 53 |
| | |----|-----|
| | | 19| 11 |
|-----------|---------|----|-----|
I need to count them by age range by and groupBy startCity - endCity
The final result should be :
|-----------|---------|-------------|
| startCity | endCity | Customer |
|-----------|---------|-------------|
| Paris | London | Range| Count|
| | |------|------|
| | |0-2 | 3 |
| | |------|------|
| | |3-18 | 0 |
| | |------|------|
| | |19-99 | 3 |
|-----------|---------|-------------|
| New-York | Paris | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 3 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
| Barcelona | London | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 1 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).
Like :
Iterable[MyObject] ite
ite.count(x => x.age match {
case Some(age) => { age >= 0 && age < 2 }
}
It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?
Thanks
EDIT : The Customer object is a case class

def computeRange(age : Int) =
if(age<=2)
"0-2"
else if(age<=10)
"2-10"
// etc, you get the idea
Then, with an RDD of case class MyObject(id : String, age : Int)
rdd
.map(x=> computeRange(x.age) -> 1)
.reduceByKey(_+_)
Edit:
If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.
def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
list
.map(_.age)
.map(computeRange)
.groupBy(x=>x)
.mapValues(_.size)
val result1 = rdd
.mapValues( computeMapOfOccurances(_))
And if you need to flatten your data, you can write:
val result2 = result1
.flatMapValues(_.toSeq)

Assuming that you have Customer[Object] as a case class as below
case class Customer(ID: Int, Age: Int)
And your RDD[MyObject] is a rdd of case class as below
case class MyObject(startCity: String, endCity: String, customer: List[Customer])
So using above case classes you should be having input (that you have in table format) as below
MyObject(Paris,London,List(Customer(1,1), Customer(2,1), Customer(3,50)))
MyObject(Paris,London,List(Customer(5,40), Customer(6,41), Customer(7,2)))
MyObject(New-York,Paris,List(Customer(9,15), Customer(10,16), Customer(11,46)))
MyObject(New-York,Paris,List(Customer(13,7), Customer(14,9), Customer(15,60)))
MyObject(Barcelona,London,List(Customer(17,66), Customer(18,53), Customer(19,11)))
And you've also mentioned that after grouping you have Iterable[MyObject] which is equivalent to below step
val groupedRDD = rdd.groupBy(myobject => (myobject.startCity, myobject.endCity)) //groupedRDD: org.apache.spark.rdd.RDD[((String, String), Iterable[MyObject])] = ShuffledRDD[2] at groupBy at worksheetTest.sc:23
So the next step for you to do is to use mapValues to iterate through the Iterable[MyObject], and then count the ages belonging to each ranges, and finally converting to the output you require as below
val finalResult = groupedRDD.mapValues(x => {
val rangeAge = Map("0-2" -> 0, "3-18" -> 0, "19-99" -> 0)
val list = x.flatMap(y => y.customer.map(z => z.Age)).toList
updateCounts(list, rangeAge).map(x => CustomerOut(x._1, x._2)).toList
})
where updateCounts is a recursive function
def updateCounts(ageList: List[Int], map: Map[String, Int]) : Map[String, Int] = ageList match{
case head :: tail => if(head >= 0 && head < 3) {
updateCounts(tail, map ++ Map("0-2" -> (map("0-2")+1)))
} else if(head >= 3 && head < 19) {
updateCounts(tail, map ++ Map("3-18" -> (map("3-18")+1)))
} else updateCounts(tail, map ++ Map("19-99" -> (map("19-99")+1)))
case Nil => map
}
and CustomerOut is another case class
case class CustomerOut(Range: String, Count: Int)
so the finalResult is as below
((Barcelona,London),List(CustomerOut(0-2,0), CustomerOut(3-18,1), CustomerOut(19-99,2)))
((New-York,Paris),List(CustomerOut(0-2,0), CustomerOut(3-18,4), CustomerOut(19-99,2)))
((Paris,London),List(CustomerOut(0-2,3), CustomerOut(3-18,0), CustomerOut(19-99,3)))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to compare two data frames in scala - scala

Related

How to do yearly comparison in spark scala

Join a DF with another two with conditions - Scala Spark

In Spark scala, how to check between adjacent rows in a dataframe

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ？

Scala RDD count by range

Categories

Resources