Multiple counts in the same sql request in spark scala - scala

I am a new developer in Spark Scala and I have a simple table like this :
City Dept employee_id employee_salary
NY FI 10 10000
NY FI null 20000
WDC IT 30 100000
LA IT 40 500
What I want to do is to calculate :
the number of employees by city & department (a simple count(*))
the number of employees with a non null id
the number of employees with a small (< 500) or medium (< 1000) or high salary (> 1000) => 3 additional counts !
So, as an output, I should have something like this :
City Dept total_emp total_emp total_emp_small total_emp_medium total_emp_high
NY FI 100 90 10 70 10
WDC IT 200 100
LA IT 10 10
The challenge here is that I want to optimize those counts. Because as you can see, we have here many counts to do. Without an optimisation the brute force for me is to do one count by request and generate a new DF result after each count and at the end I do a left joint based on the fixed column (city & dept) to add those new columns. But it will be too heavy since my table contains a lot of data.
I think that the method "window" can simplify this but I am not sure.
Can you help me this at least with just 2 cases (id != null and salary < 500).
Thank you in advance

scala> df.show
+----+----+-----------+---------------+
|City|Dept|employee_id|employee_salary|
+----+----+-----------+---------------+
| NY| FI| 10| 10000|
| NY| FI| null| 20000|
| WDC| IT| 30| 100000|
| LA| IT| 40| 500|
| LA| IT| 40| 600|
| LA| IT| null| 200|
+----+----+-----------+---------------+
scala> val df1 = df.withColumn("NonNullEmpID", when(col("employee_id").isNotNull, lit(1)).otherwise(lit(0)))
.withColumn("Salary_Cat",
when(col("employee_salary") < 500, lit("SmallSalary")).
when(col("employee_salary") < 1000, lit("MediumSalary")).
when(col("employee_salary") >= 1000, lit("HighSalary")))
scala> val SalaryMap = df1.groupBy("Salary_Cat").agg(count(lit(1)).alias("count")).collect.flatMap(x => Map(x(0).toString -> x.toSeq.slice(1, x.length).mkString)).toMap
//the number of employees by city & department (a simple count(*))
//the number of employees with a non null id
//the number of employees with a small (< 500) or medium (< 1000) or high salary (> 1000) => 3 additional counts !
scala> df1.groupBy("City", "Dept")
.agg(count(lit(1)).alias("#Emp_per_City_Dept"), sum(col("NonNullEmpID")).alias("NonNullEmpID"))
.withColumn("total_emp_small", lit(SalaryMap("SmallSalary")))
.withColumn("total_emp_medium", lit(SalaryMap("MediumSalary")))
.withColumn("total_emp_high", lit(SalaryMap("HighSalary")))
.show()
+----+----+------------------+------------+---------------+----------------+--------------+
|City|Dept|#Emp_per_City_Dept|NonNullEmpID|total_emp_small|total_emp_medium|total_emp_high|
+----+----+------------------+------------+---------------+----------------+--------------+
| NY| FI| 2| 1| 1| 2| 3|
| LA| IT| 3| 2| 1| 2| 3|
| WDC| IT| 1| 1| 1| 2| 3|
+----+----+------------------+------------+---------------+----------------+--------------+

Related

How to filter IDs which meet two conditions over another column in pyspark?

I have a table looking like this:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
A10352704
PE
1
2
A10352704
ES
1
2
I would like to keep the IDs whose column country takes the values ES and MX. So, in this case I would like to get an output showing the following:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
Thank you very much!
You can create a countryAgg dataframe which will contain flags for both MX and ES by aggregating it at the id level and further marking it with array_overlaps to check against both of the countries
And further utilise filter to only filter on ids containing both ES and MX as below -
Data Preparation
s = StringIO("""
id country count count_1
A36992434 MX 1 2
A36992434 ES 1 2
A00749707 ES 1 2
A00749707 MX 1 2
A10352704 PE 1 2
A10352704 ES 1 2
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
|A10352704| PE| 1| 2|
|A10352704| ES| 1| 2|
+---------+-------+-----+-------+
Array Overlap Marking
countryAgg = sparkDF.groupBy(F.col('id')).agg(F.collect_set(F.col('country')).alias('country_set'))
countryAgg = countryAgg.withColumn('country_check_mx',F.array(F.lit('MX')))\
.withColumn('country_check_es',F.array(F.lit('ES')))\
.withColumn("overlap_flag_mx"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_mx"))
)\
.withColumn("overlap_flag_es"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_es"))
)
countryAgg.show()
+---------+-----------+----------------+----------------+---------------+---------------+
| id|country_set|country_check_mx|country_check_es|overlap_flag_mx|overlap_flag_es|
+---------+-----------+----------------+----------------+---------------+---------------+
|A36992434| [MX, ES]| [MX]| [ES]| true| true|
|A00749707| [ES, MX]| [MX]| [ES]| true| true|
|A10352704| [ES, PE]| [MX]| [ES]| false| true|
+---------+-----------+----------------+----------------+---------------+---------------+
Joining
countryAgg = countryAgg.filter((F.col('overlap_flag_mx') & F.col('overlap_flag_es')))
sparkDF.join(countryAgg
,sparkDF['id'] == countryAgg['id']
,'inner'
).select(sparkDF['*'])\
.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
+---------+-------+-----+-------+

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.
As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)