Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates - scala

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.

As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

Related

How to filter IDs which meet two conditions over another column in pyspark?

I have a table looking like this:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
A10352704
PE
1
2
A10352704
ES
1
2
I would like to keep the IDs whose column country takes the values ES and MX. So, in this case I would like to get an output showing the following:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
Thank you very much!
You can create a countryAgg dataframe which will contain flags for both MX and ES by aggregating it at the id level and further marking it with array_overlaps to check against both of the countries
And further utilise filter to only filter on ids containing both ES and MX as below -
Data Preparation
s = StringIO("""
id country count count_1
A36992434 MX 1 2
A36992434 ES 1 2
A00749707 ES 1 2
A00749707 MX 1 2
A10352704 PE 1 2
A10352704 ES 1 2
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
|A10352704| PE| 1| 2|
|A10352704| ES| 1| 2|
+---------+-------+-----+-------+
Array Overlap Marking
countryAgg = sparkDF.groupBy(F.col('id')).agg(F.collect_set(F.col('country')).alias('country_set'))
countryAgg = countryAgg.withColumn('country_check_mx',F.array(F.lit('MX')))\
.withColumn('country_check_es',F.array(F.lit('ES')))\
.withColumn("overlap_flag_mx"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_mx"))
)\
.withColumn("overlap_flag_es"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_es"))
)
countryAgg.show()
+---------+-----------+----------------+----------------+---------------+---------------+
| id|country_set|country_check_mx|country_check_es|overlap_flag_mx|overlap_flag_es|
+---------+-----------+----------------+----------------+---------------+---------------+
|A36992434| [MX, ES]| [MX]| [ES]| true| true|
|A00749707| [ES, MX]| [MX]| [ES]| true| true|
|A10352704| [ES, PE]| [MX]| [ES]| false| true|
+---------+-----------+----------------+----------------+---------------+---------------+
Joining
countryAgg = countryAgg.filter((F.col('overlap_flag_mx') & F.col('overlap_flag_es')))
sparkDF.join(countryAgg
,sparkDF['id'] == countryAgg['id']
,'inner'
).select(sparkDF['*'])\
.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
+---------+-------+-----+-------+

Accumulator gives different result then direct function applying

Trying to combine two result sets I've faced with different behavior when joining two keyed tables:
q)show t:([a:1 1 2]b:011b)
a| b
-| -
1| 0
1| 1
2| 1
q)t,t
a| b
-| -
1| 1
1| 1
2| 1
q)(,/)(t;t)
a| b
-| -
1| 1
2| 1
Why does the accumulator ,/ remove duplicated keys, and why its result differs from a direct table join ,?
I suspect that join over (aka ,/ aka raze) has special handling under the covers that isn't exposed to the end user.
The interpreter recognises the ,/ and behaves a certain way depending on the inputs. This likely applies to dictionaries and keyed tables:
q)raze(`a`a`b!1 2 3;`a`b!9 9)
a| 9
b| 9
q)
q)(`a`a`b!1 2 3),`a`b!9 9
a| 9
a| 2
b| 9
q)
q)({x,y}/)(`a`a`b!1 2 3;`a`b!9 9)
a| 9
a| 2
b| 9

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Calculating the rolling sums in pyspark

I have a dataframe that contains information on the daily sales and daily clicks. Before I want to run my analysis, I want to aggregate the data. To make myself clearer, I will try to explain it on an example dataframe
item_id date Price Sale Click Discount_code
2 01.03.2019 10 1 10 NULL
2 01.03.2019 8 1 10 Yes
2 02.03.2019 10 0 4 NULL
2 03.03.2019 10 0 6 NULL
2 04.03.2019 6 0 15 NULL
2 05.03.2019 6 0 14 NULL
2 06.03.2019 5 0 7 NULL
2 07.03.2019 5 1 11 NULL
2 07.03.2019 5 1 11 NULL
2 08.03.2019 5 0 9 NULL
If there are two sales for the given day, I have two observations for that particular day. I want to convert my dataframe to the following one by collapsing observations by item_id and price:
item_id Price CSale Discount_code Cclicks firstdate lastdate
2 10 1 No 20 01.03.2019 03.03.2019
2 8 1 Yes 10 01.03.2019 01.03.2019
2 6 0 NULL 29 04.03.2019 05.03.2019
2 5 2 NULL 38 06.03.2019 08.03.2019
Where CSale correponds to the cumulative sales for the given price and given item_id, Cclicks corresponds to the cumulative clicks for the given price and given item_id, firstdate is the first date on which the given item was available for the given price and lastdate is the last date on which the given item was available for the given price.
According to the problem, OP wants to aggregate the DataFrame on the basis of item_id and Price.
# Creating the DataFrames
from pyspark.sql.functions import col, to_date, sum, min, max, first
df = sqlContext.createDataFrame([(2,'01.03.2019',10,1,10,None),(2,'01.03.2019',8,1,10,'Yes'),
(2,'02.03.2019',10,0,4,None),(2,'03.03.2019',10,0,6,None),
(2,'04.03.2019',6,0,15,None),(2,'05.03.2019',6,0,14,None),
(2,'06.03.2019',5,0,7,None),(2,'07.03.2019',5,1,11,None),
(2,'07.03.2019',5,1,11,None),(2,'08.03.2019',5,0,9,None)],
('item_id','date','Price','Sale','Click','Discount_code'))
# Converting string column date to proper date
df = df.withColumn('date',to_date(col('date'),'dd.MM.yyyy'))
df.show()
+-------+----------+-----+----+-----+-------------+
|item_id| date|Price|Sale|Click|Discount_code|
+-------+----------+-----+----+-----+-------------+
| 2|2019-03-01| 10| 1| 10| null|
| 2|2019-03-01| 8| 1| 10| Yes|
| 2|2019-03-02| 10| 0| 4| null|
| 2|2019-03-03| 10| 0| 6| null|
| 2|2019-03-04| 6| 0| 15| null|
| 2|2019-03-05| 6| 0| 14| null|
| 2|2019-03-06| 5| 0| 7| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-08| 5| 0| 9| null|
+-------+----------+-----+----+-----+-------------+
As can be seen in the printSchema below that the dataframe's date column is in date format.
df.printSchema()
root
|-- item_id: long (nullable = true)
|-- date: date (nullable = true)
|-- Price: long (nullable = true)
|-- Sale: long (nullable = true)
|-- Click: long (nullable = true)
|-- Discount_code: string (nullable = true)
Finally aggregating agg() the columns below. Just a caveat - Since Discount_code is a string column and we need to aggregate it as well, we will take the first non-Null value while grouping.
df = df.groupBy('item_id','Price').agg(sum('Sale').alias('CSale'),
first('Discount_code',ignorenulls = True).alias('Discount_code'),
sum('Click').alias('Cclicks'),
min('date').alias('firstdate'),
max('date').alias('lastdate'))
df.show()
+-------+-----+-----+-------------+-------+----------+----------+
|item_id|Price|CSale|Discount_code|Cclicks| firstdate| lastdate|
+-------+-----+-----+-------------+-------+----------+----------+
| 2| 6| 0| null| 29|2019-03-04|2019-03-05|
| 2| 5| 2| null| 38|2019-03-06|2019-03-08|
| 2| 8| 1| Yes| 10|2019-03-01|2019-03-01|
| 2| 10| 1| null| 20|2019-03-01|2019-03-03|
+-------+-----+-----+-------------+-------+----------+----------+

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)