Spark [Scala]: Checking if all the Rows of a smaller DataFrame exists in the bigger DataFrame - scala

I got two DataFrames, with the same schema (but +100 columns):
Small size: 1000 rows
Bigger size: 90000 rows
How to check every Row in 1 exists in 2? What is the "Spark way" of doing this? Should I use map and then deal with it at the Row level; or I use join and then use some sort of comparison with the small size DataFrame?

You can use except, which returns all rows of the first dataset that are not present in the second
smaller.except(bigger).isEmpty()

You can inner join the DF and count to check if ther eis a difference.
def isIncluded(smallDf: Dataframe, biggerDf: Dataframe): Boolean = {
val keys = smallDf.columns.toSeq
val joinedDf = smallDf.join(biggerDf, keys) // You might want to broadcast smallDf for performance issues
joinedDf.count == smallDf
}
However, I think the except method is clearer. Not sure about the performances (It might just be a join underneath)

I would do it with join, probably
This join will give you all rows that are in small data frame but are missing in large data frame. Then just check if it is zero size or no.
Code:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
df2.join(df1, df1("cA") === df2("cA"), "leftOuter").show
Output:
+---+---+---+---+---+---+---+---+---+---+
| cA| cB| cH| cI| cJ| cA| cB| cC| cD| cE|
+---+---+---+---+---+---+---+---+---+---+
| C|ghi| a| c| ?| C|ghi|0.2|0.2| 1|
| A|abc| a| b| ?| A|abc|0.1|0.0| 0|
+---+---+---+---+---+---+---+---+---+---+

Related

How to find change occurance points in a Spark dataframe

I have a simple dataframe as an example:
val someDF = Seq(
(1, "A"),
(2, "A"),
(3, "A"),
(4, "B"),
(5, "B"),
(6, "A"),
(7, "A"),
(8, "A")
).toDF("t", "state")
// this part is half pseudocode
someDF.aggregate((acc, cur) => {
if (acc.last.state != cur.state) {
acc.add(cur)
}
}, List()).show(truncate=false)
"t" column represents points in time and "state" column represents the state at that point in time.
What I wish to find is the first time where each change occurs plus the first row, as in:
(1, "A")
(4, "B")
(6, "A")
I looked at the solutions in SQL too but they involve complex self-joins and window functions which I don't completely understand, but an SQL solution is OK too.
There are numerous functions in spark (fold, aggregate, reduce ..) that I feel which can do this, but I couldn't grasp the differences since I'm new to spark concepts like partitioning, and it's a bit tricky if the partitioning could affect the results.
You can use the window function lag for comparing with the previous row, and row_number for checking whether it's the first row:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val result = someDF.withColumn(
"change",
lag("state", 1).over(Window.orderBy("t")) =!= col("state") ||
row_number().over(Window.orderBy("t")) === 1
).filter("change").drop("change")
result.show
+---+-----+
| t|state|
+---+-----+
| 1| A|
| 4| B|
| 6| A|
+---+-----+
For an SQL solution:
someDF.createOrReplaceTempView("mytable")
val result = spark.sql("""
select t, state
from (
select
t, state,
lag(state) over (order by t) != state or row_number() over (order by t) = 1 as change
from mytable
)
where change
""")

Calculate value based on value from same column of the previous row in spark

I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row.
I am unable to figure it out using withColumn API.
I need to calculate a new column, using the formula:
MovingRate = MonthlyRate + (0.7 * MovingRatePrevious)
... where the MovingRatePrevious is the MovingRate of the prior row.
For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I need to partition by Type.
This is my original dataset:
Desired results in MovingRate column:
Altough its possible to do with Widow Functions (See #Leo C's answer), I bet its more performant to aggregate once per Type using a groupBy. Then, explode the results of the UDF to get all rows back:
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None)
)
.toDF("Month", "Type", "MonthlyRate", "MovingRate")
// this udf produces an Seq of Tuple3 (Month, MonthlyRate, MovingRate)
val calcMovingRate = udf((startRate:Double,rates:Seq[Row]) => rates.tail
.scanLeft((rates.head.getInt(0),startRate,startRate))((acc,curr) => (curr.getInt(0),curr.getDouble(1),acc._3+0.7*curr.getDouble(1)))
)
df
.groupBy($"Type")
.agg(
first($"MovingRate",ignoreNulls=true).as("startRate"),
collect_list(struct($"Month",$"MonthlyRate")).as("rates")
)
.select($"Type",explode(calcMovingRate($"startRate",$"rates")).as("movingRates"))
.select($"Type",$"movingRates._1".as("Month"),$"movingRates._2".as("MonthlyRate"),$"movingRates._3".as("MovingRate"))
.show()
gives:
+----+-----+-----------+------------------+
|Type|Month|MonthlyRate| MovingRate|
+----+-----+-----------+------------------+
|blue| 1| 0.33| 0.33|
|blue| 2| 0.3| 0.54|
|blue| 3| 0.7| 1.03|
|blue| 4| 0.9|1.6600000000000001|
+----+-----+-----------+------------------+
Given the nature of the requirement that each moving rate is recursively computed from the previous rate, the column-oriented DataFrame API won't shine especially if the dataset is huge.
That said, if the dataset isn't large, one approach would be to make Spark recalculate the moving rates row-wise via a UDF, with a Window-partitioned rate list as its input:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None),
(1, "red", 0.5, Some(0.2)),
(2, "red", 0.6, None),
(3, "red", 0.8, None)
).toDF("Month", "Type", "MonthlyRate", "MovingRate")
val win = Window.partitionBy("Type").orderBy("Month").
rowsBetween(Window.unboundedPreceding, 0)
def movingRate(factor: Double) = udf( (initRate: Double, monthlyRates: Seq[Double]) =>
monthlyRates.tail.foldLeft(initRate)( _ * factor + _ )
)
df.
withColumn("MovingRate", when($"Month" === 1, $"MovingRate").otherwise(
movingRate(0.7)(last($"MovingRate", ignoreNulls=true).over(win), collect_list($"MonthlyRate").over(win))
)).
show
// +-----+----+-----------+------------------+
// |Month|Type|MonthlyRate| MovingRate|
// +-----+----+-----------+------------------+
// | 1| red| 0.5| 0.2|
// | 2| red| 0.6| 0.74|
// | 3| red| 0.8| 1.318|
// | 1|blue| 0.4| 0.33|
// | 2|blue| 0.3|0.5309999999999999|
// | 3|blue| 0.7|1.0716999999999999|
// | 4|blue| 0.9|1.6501899999999998|
// +-----+----+-----------+------------------+
What you are trying to do is compute a recursive formula that looks like:
x[i] = y[i] + 0.7 * x[i-1]
where x[i] is your MovingRate at row i and y[i] your MonthlyRate at row i.
The problem is that this is a purely sequential formula. Each row needs the result of the previous one which in turn needs the result of the one before. Spark is a parallel computation engine and it is going to be hard to use it to speed up a calculation that cannot really be parallelized.

Problem in converting MS-SQL Query to spark SQL

I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
An example:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
You will need to tailor to your own situation. E.g. count instead of sum. As follows:
val df2 = df
.groupBy("c1")
.agg(count("*"))
.withColumn("fraction", col("count(1)") / sum("count(1)").over())
returning:
+---+--------+-------------------+
| c1|count(1)| fraction|
+---+--------+-------------------+
| E| 1|0.16666666666666666|
| B| 1|0.16666666666666666|
| D| 1|0.16666666666666666|
| C| 1|0.16666666666666666|
| A| 2| 0.3333333333333333|
+---+--------+-------------------+
You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Row aggregations in Scala

I am looking for a way to get a new column in a data frame in Scala that calculates the min/max of the values in col1, col2, ..., col10 for each row.
I know I can do it with a UDF but maybe there is an easier way.
Thanks!
Porting this Python answer by user6910411
import org.apache.spark.sql.functions._
val df = Seq(
(1, 3, 0, 9, "a", "b", "c")
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "Col7")
val cols = Seq("col1", "col2", "col3", "col4")
val rowMax = greatest(
cols map col: _*
).alias("max")
val rowMin = least(
cols map col: _*
).alias("min")
df.select($"*", rowMin, rowMax).show
// +----+----+----+----+----+----+----+---+---+
// |col1|col2|col3|col4|col5|col6|Col7|min|max|
// +----+----+----+----+----+----+----+---+---+
// | 1| 3| 0| 9| a| b| c|0.0|9.0|
// +----+----+----+----+----+----+----+---+---+