Spark : aggregate values by a list of given years - scala

I'm new to Scala, say I have a dataset :
>>> ds.show()
+--------------+-----------------+-------------+
|year |nb_product_sold | system_year |
+--------------+-----------------+-------------+
|2010 | 1 | 2012 |
|2012 | 2 | 2012 |
|2012 | 4 | 2012 |
|2015 | 3 | 2012 |
|2019 | 4 | 2012 |
|2021 | 5 | 2012 |
+--------------+-----------------+-------+
and I have a List<Integer> years = {1, 3, 8}, which means the x year after system_year year.
The goal is to calculate the number of total sold products for each year after system_year.
In other words, I have to calculate the total sold products for year 2013, 2015, 2020.
The output dataset should be like this :
+-------+-----------------------+
| year | total_product_sold |
+-------+-----------------------+
| 1 | 6 | -> 2012 - 2013 6 products sold
| 3 | 9 | -> 2012 - 2015 9 products sold
| 8 | 13 | -> 2012 - 2020 13 products sold
+-------+-----------------------+
I want to know how to do this in scala ? Should I use groupBy() in this case ?

You could have used a groupby case/when if the year ranges didn't overlap. But here you'll need to do a groupby for each year and then union the 3 grouped dataframes :
val years = List(1, 3, 8)
val result = years.map{ y =>
df.filter($"year".between($"system_year", $"system_year" + y))
.groupBy(lit(y).as("year"))
.agg(sum($"nb_product_sold").as("total_product_sold"))
}.reduce(_ union _)
result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//| 1| 6|
//| 3| 9|
//| 8| 13|
//+----+------------------+

There might be multiple ways of doing things and more efficient than what I am showing you but it works for your use case.
//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year")
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")
you can see the output as below:

Related

Spark scala dynamically create filter condition value from sequence Pair

I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+

Transpose in Spark scala logic

I have below dataframe in spark scala dataframe:
-------------
a | b| c| d |
-------------
1 | 2| 3 | 4 |
5 | 6| 7 | 8 |
9 | 10| 11 | 12 |
13 | 14| 15 | 16 |
From my code it becomes a map of every rows and,code I try is:
df.select(map(df.columns.flatMap(c => Seq(lit(c),col(c))):_*).as("map"))
Map(String-> String) with 4 records only
Map(a->1,b->2,c->3,d->4)
Map(a->5,b->6,c->7,d->8)
Map(a->9,b->10,c->11,d->12)
Map(a->13,b->14,c->15,d->16)
But I wanted to change like below:
a->1
b->2
c->3
d->4
a->5
b->6
c->7
d->8
a->9
b->10
c->11
d->12
a->13
b->14
c->15
d->16
Any suggestion to change/add code to get desired result, I think it should be any transpose logic I am kind of new in scala .
Use explode to explode map data.Try below code.
df.select(map(df.columns.flatMap(c => Seq(lit(c),col(c))):_*).as("map"))
.select(explode($"map"))
.show(false)
Without map used array
val colExpr = array(
df
.columns
.flatMap(c => Seq(struct(lit(c).as("key"),col(c).as("value")).as("map"))):_*
).as("map")
df
.select(colExpr)
.select(explode($"map").as("map"))
.select($"map.*").show(false)

How can I rank items in a RDD to build a streak?

I have an RDD containing data like this: (downloadId: String, date: LocalDate, downloadCount: Int). The date and download-id are unique and the download-count is for the date.
What I've been trying to accomplish is to get the number of consecutive days (going backwards from the current date) that a download-id was in the top 100 of all download-ids. So if a given download was in the top-100 today, yesterday and the day before, then it's streak would be 3.
In SQL, I guess this could be solved using window functions. I've seen similar questions like this. How to add a running count to rows in a 'streak' of consecutive days
(I'm rather new to Spark but wasn't sure to how to map-reduce an RDD to even begin solving a problem like this.)
Some more information, the dates are the last 30 days and there are approximately unique 4M download-ids per day.
I suggest you work with DataFrames, as they are much easier to use than RDDs. Leo's answer is shorter, but I couldn't find where it was filtering for the top 100 downloads, so I decide to post my answer as well. It does not depend on window functions, but it is bound on the number of days in the past you want to streak by. Since you said you only use the last 30 days' data, that should not be a problem.
As a first Step, I wrote some code to generate a DF similar to what you described. You don't need to run this first block (if you do, reduce the number of rows unless you have a cluster to try it on, it's heavy on memory). You can see how to transform the RDD (theData) into a DF (baseData). You should define a schema for it, like I did.
import java.time.LocalDate
import scala.util.Random
val maxId = 10000
val numRows = 15000000
val lastDate = LocalDate.of(2017, 12, 31)
// Generates the data. As a convenience for working with Dataframes, I converted the dates to epoch days.
val theData = sc.parallelize(1.to(numRows).map{
_ => {
val id = Random.nextInt(maxId)
val nDownloads = Random.nextInt((id / 1000 + 1))
Row(id, lastDate.minusDays(Random.nextInt(30)).toEpochDay, nDownloads)
}
})
//Working with Dataframes is much simples, so I'll generate a DF named baseData from the RDD
val schema = StructType(
StructField("downloadId", IntegerType, false) ::
StructField("date", LongType, false) ::
StructField("downloadCount", IntegerType, false) :: Nil)
val baseData = sparkSession.sqlContext.createDataFrame(theData, schema)
.groupBy($"downloadId", $"date")
.agg(sum($"downloadCount").as("downloadCount"))
.cache()
Now you have the data you want in a DF called baseData. The next step is to restrict it to the top 100 for each day - you should discard the data you don't before doing any additional heavy transformations.
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}
def filterOnlyTopN(data: DataFrame, n: Int = 100): DataFrame = {
// For each day in the data, let's find the cutoff # of downloads to make it into the top N
val getTopNCutoff = udf((downloads: Seq[Long]) => {
val reverseSortedDownloads = downloads.sortBy{- _ }
if (reverseSortedDownloads.length >= n)
reverseSortedDownloads.drop(n - 1).head
else
reverseSortedDownloads.last
})
val topNLimitsByDate = data.groupBy($"date").agg(collect_set($"downloadCount").as("downloads"))
.select($"date", getTopNCutoff($"downloads").as("cutoff"))
// And then, let's throw away the records below the top 100
data.join(topNLimitsByDate, Seq("date"))
.filter($"downloadCount" >= $"cutoff")
.drop("cutoff", "downloadCount")
}
val relevantData = filterOnlyTopN(baseData)
Now that you have the relevantData DF with only the data you need, you can calculate the streak for them. I have left the ids with no streaks as streak 0, you can filter those out by using streaks.filter($"streak" > lit(0)).
def getStreak(df: DataFrame, fromDate: Long): DataFrame = {
val calcStreak = udf((dateList: Seq[Long]) => {
if (!dateList.contains(fromDate))
0
else {
val relevantDates = dateList.sortBy{- _ } // Order the dates descending
.dropWhile(_ != fromDate) // And drop everything until we find the starting day we are interested in
if (relevantDates.length == 1) // If there's only one day left, it's a one day streak
1
else // Otherwise, let's count the streak length (this works if no dates are left, too - but not with only 1 day)
relevantDates.sliding(2) // Take days by pairs
.takeWhile{twoDays => twoDays(1) == twoDays(0) - 1} // While the pair is of consecutive days
.length+1 // And the streak will be the number of consecutive pairs + 1 (the initial day of the streak)
}
})
df.groupBy($"downloadId").agg(collect_list($"date").as("dates")).select($"downloadId", calcStreak($"dates").as("streak"))
}
val streaks = getStreak(relevantData, lastDate.toEpochDay)
streaks.show()
+------------+--------+
| downloadId | streak |
+------------+--------+
| 8086 | 0 |
| 9852 | 0 |
| 7253 | 0 |
| 9376 | 0 |
| 7833 | 0 |
| 9465 | 1 |
| 7880 | 0 |
| 9900 | 1 |
| 7993 | 0 |
| 9427 | 1 |
| 8389 | 1 |
| 8638 | 1 |
| 8592 | 1 |
| 6397 | 0 |
| 7754 | 1 |
| 7982 | 0 |
| 7554 | 0 |
| 6357 | 1 |
| 7340 | 0 |
| 6336 | 0 |
+------------+--------+
And there you have the streaks DF with the data you need.
Using a similar approach in the listed PostgreSQL link, you can apply Window function in Spark as well. Spark's DataFrame API doesn't have encoders for java.time.LocalDate, so you'll need to convert it to, say, java.sql.Date.
Here're the steps: First, transfrom the RDD to a DataFrame with supported date format; next, create a UDF to compute the baseDate which requires a date and a per-id chronological row-number (generated using Window function) as parameters. Another Window function is applied to calculate per-id-baseDate row-number, which is the wanted streak value:
import java.time.LocalDate
val rdd = sc.parallelize(Seq(
(1, LocalDate.parse("2017-12-13"), 2),
(1, LocalDate.parse("2017-12-16"), 1),
(1, LocalDate.parse("2017-12-17"), 1),
(1, LocalDate.parse("2017-12-18"), 2),
(1, LocalDate.parse("2017-12-20"), 1),
(1, LocalDate.parse("2017-12-21"), 3),
(2, LocalDate.parse("2017-12-15"), 2),
(2, LocalDate.parse("2017-12-16"), 1),
(2, LocalDate.parse("2017-12-19"), 1),
(2, LocalDate.parse("2017-12-20"), 1),
(2, LocalDate.parse("2017-12-21"), 2),
(2, LocalDate.parse("2017-12-23"), 1)
))
val df = rdd.map{ case (id, date, count) => (id, java.sql.Date.valueOf(date), count) }.
toDF("downloadId", "date", "downloadCount")
def baseDate = udf( (d: java.sql.Date, n: Long) =>
new java.sql.Date(new java.util.Date(d.getTime).getTime - n * 24 * 60 * 60 * 1000)
)
import org.apache.spark.sql.expressions.Window
val dfStreak = df.withColumn("rowNum", row_number.over(
Window.partitionBy($"downloadId").orderBy($"date")
)
).withColumn(
"baseDate", baseDate($"date", $"rowNum")
).select(
$"downloadId", $"date", $"downloadCount", row_number.over(
Window.partitionBy($"downloadId", $"baseDate").orderBy($"date")
).as("streak")
).orderBy($"downloadId", $"date")
dfStreak.show
+----------+----------+-------------+------+
|downloadId| date|downloadCount|streak|
+----------+----------+-------------+------+
| 1|2017-12-13| 2| 1|
| 1|2017-12-16| 1| 1|
| 1|2017-12-17| 1| 2|
| 1|2017-12-18| 2| 3|
| 1|2017-12-20| 1| 1|
| 1|2017-12-21| 3| 2|
| 2|2017-12-15| 2| 1|
| 2|2017-12-16| 1| 2|
| 2|2017-12-19| 1| 1|
| 2|2017-12-20| 1| 2|
| 2|2017-12-21| 2| 3|
| 2|2017-12-23| 1| 1|
+----------+----------+-------------+------+

Select Records Using Relative Dates in Hive / Scala

Is there a good way to solve this problem: Suppose I want to select records that are at least 6 months prior to the previously selected record for a given grouping.
Ie. I have:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-10-01 00:00:00
1 A 2014-05-01 00:00:00
1 A 2014-01-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-10-01 00:00:00
2 A 2014-01-01 00:00:00
2 A 2013-10-01 00:00:00
I'd like to select only dates that are at least 6 months apart relative to the previously selected one. Ie it will return:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-05-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-01-01 00:00:00
It is obvious to me how to do this using orderings if you wanted to select relative to the latest ones
(ie:
SELECT b.date, b..., a.latest_date
FROM(
SELECT *, row_number OVER PARTITION BY Col A, Col B ORDER BY Date as row_number
FROM table1) temp
WHERE row_number = 1) a
INNER JOIN TABLE 1 b
ON KEY)
WHERE datediff(date, latestdate)/365 > 0.5
or so
, but I'm a little unclear how you would do this relative to each other. Is there a way to do this recursively in Hive / Scala or something?
Hi there is a concept of windowing lag and lead concept is both hive and spark, you can achieve this task in both. Here is the code in spark.
val data = sc.parallelize(List(("1", "A", "2015-01-01 00:00:00"),
| | ("1", "A", "2014-10-01 00:00:00"),
| | ("1", "A", "2014-01-01 00:00:00"),
| | ("1", "B", "2014-01-01 00:00:00"),
| | ("2", "A", "2015-01-01 00:00:00"),
| | ("2", "A", "2014-10-01 00:00:00"),
| | ("2", "A", "2014-01-01 00:00:00"),
| | ("2", "A", "2013-10-01 00:00:00")
| | )).toDF("id","status","Date");
val data2 =data.select($"id",$"status",to_date($"Date").alias(date));
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec3 = Window.partitionBy("id","status").orderBy(desc("date"));
val data3 = data2.withColumn("diff",datediff(lag(data2("date"), 1).over(wSpec3),$"date")).filter($"diff">182.5 || $"diff".isNull);
data3.show
+---+------+----------+----+
| id|status| date|diff|
+---+------+----------+----+
| 1| A|2015-01-01|null|
| 1| A|2014-01-01| 273|
| 1| B|2014-01-01|null|
| 2| A|2015-01-01|null|
| 2| A|2014-01-01| 273|
+---+------+----------+----+

Scala/ Spark- Multiply an Integer with each value in a Dataframe Column

I have a sample dataframe
df_that_I_have
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 50 | 1 |
+---------+---------+-------+
| Japan | 20 | 3 |
+---------+---------+-------+
| India | 20 | 1 |
+---------+---------+-------+
| Japan | 10 | 3 |
+---------+---------+-------+
and I want a dataframe that looks like this
df_that_I_want
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 70 | 10 | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan | 30 | 30 | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+
The second dataframe has the sum of members and the sum of some multiplied 5.
This is what I'm doing to achieve this
val df_that_I_want = df_that_I_have
.select(df_that_I_have("country"),
df_that_I_have.groupBy("country").sum("members"),
5 * df_that_I_have.groupBy("country").sum("some")) //Problem here
But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.
How can I multiply an Integer value with the sum of some for each country?
You can try lit function.
scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]
scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]
scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]
scala> df_that_I_want.show
+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
| India| 70| 10|
| Japan| 30| 30|
+-------+-------+----+
Please try this
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
lit function is used for creating the column of literal value which is 5 here.
when you are not able to multiply 5 directly, it is creating a column containing 5 and multiplying with it.