How to use countDistinct using a window function in Spark/Scala? - scala

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -
rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:
Is there any workaround for this ?

A previous answer suggested two possible techniques: approximate counting and size(collect_set(...)). Both have problems.
If you need an exact count, which is the main reason to use COUNT(DISTINCT ...) in big data, approximate counting will not do. Also, approximate counting actual error rates can vary quite significantly for small data.
size(collect_set(...)) may cause a substantial slowdown in processing of big data because it uses a mutable Scala HashSet, which is a pretty slow data structure. In addition, you may occasionally get strange results, e.g., if you run the query over an empty dataframe, because size(null) produces the counterintuitive -1. Spark's native distinct counting runs faster for a number of reasons, the main one being that it doesn't have to produce all the counted data in an array.
The typical approach to solving this problem is with a self-join. You group by whatever columns you need, compute the distinct count or any other aggregate function that cannot be used as a window function, and then join back to your original data.

Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.
Example:
df.show()
//+---+---+---+
//| i| j| k|
//+---+---+---+
//| 1| a| c|
//| 2| b| d|
//| 1| a| c|
//| 2| b| e|
//+---+---+---+
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("i","j")
df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()
//or using approx_count_distinct
df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()
//+---+---+---+---+
//| i| j| k|cnt|
//+---+---+---+---+
//| 2| b| d| 2|
//| 2| b| e| 2|
//| 1| a| c| 1| //as c value repeated for 1,a partition
//| 1| a| c| 1|
//+---+---+---+---+

Trying to improve Sim's answer, if you want to do this:
//val newColumnName: String = ...
//val colToCount: Column = ...
//val aggregatingCols: Seq[Column] = ...
df.withColumn(newColName, countDistinct(colToCount).over(partitionBy(aggregatingCols:_*)))
You must instead do this:
//val aggregatingCols: Seq[String] = ...
df.groupBy(aggregatingCols.head, aggregatingCols.tail:_*)
.agg(countDistinct(colToCount).as(newColName))
.select(newColName, aggregatingCols:_*)
.join(df, usingColumns = aggregatingCols)

This will return the number of distinct elements in the partition, using dense_rank() function. When we sum ascending and descending rank, we always get the total number of distinct elements + 1 :
dense_rank().over(Window.partitionBy("i").orderBy(c.asc)) + dense_rank().over(Window.partitionBy("i").orderBy(c.desc)) - 1

Related

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.
For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

How to standardize a column in PySpark without using StandardScaler?

Seems like this should work, but I'm getting errors:
mu = mean(df[input])
sigma = stddev(df[input])
dft = df.withColumn(output, (df[input]-mu)/sigma)
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is
empty, and '`user`' is not an aggregate function. Wrap
'(((CAST(`sum(response)` AS DOUBLE) - avg(`sum(response)`)) /
stddev_samp(CAST(`sum(response)` AS DOUBLE))) AS `scaled`)' in
windowing function(s) or wrap '`user`' in first() (or first_value) if
you don't care which value you get.;;\nAggregate [user#0,
sum(response)#26L, ((cast(sum(response)#26L as double) -
avg(sum(response)#26L)) / stddev_samp(cast(sum(response)#26L as
double))) AS scaled#46]\n+- AnalysisBarrier\n +- Aggregate
[user#0], [user#0, sum(cast(response#3 as bigint)) AS
sum(response)#26L]\n +- Filter item_id#1 IN
(129,130,131,132,133,134,135,136,137,138)\n +-
Relation[user#0,item_id#1,response_value#2,response#3,trait#4,response_timestamp#5]
csv\n"
I'm not sure what's going on with this error message.
Using collect() is not a good solution in general and you will see that this will not scale as your data grows.
If you don't want to use StandardScaler, a better way is to use a Window to compute the mean and standard deviation.
Borrowing the same example from StandardScaler in Spark not working as expected:
from pyspark.sql.functions import col, mean, stddev
from pyspark.sql import Window
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
Suppose you wanted to standardize the column int2:
input_col = "int2"
output_col = "int2_scaled"
w = Window.partitionBy()
mu = mean(input_col).over(w)
sigma = stddev(input_col).over(w)
df.withColumn(output_col, (col(input_col) - mu)/(sigma)).show()
#+----+----+----+-----------+
#|int1|int2|int3|int2_scaled|
#+----+----+----+-----------+
#| 1| 2| 3| -1.0|
#| 7| 8| 9| 1.0|
#| 4| 5| 6| 0.0|
#+----+----+----+-----------+
If you wanted to use the population standard deviation as in the other example, replace pyspark.sql.functions.stddev with pyspark.sql.functions.stddev_pop().
Fortunately, I was able to find code that works:
summary = df.select([mean(input).alias('mu'), stddev(input).alias('sigma')])\
.collect().pop()
dft = df.withColumn(output, (df[input]-summary.mu)/summary.sigma)

How to add a new column with maximum value?

I have a Dataframe with 2 columns tag and value.
I want to add a new column that contains the max of value column. (It will be the same value for every row).
I tried to do something as follows, but it didn't work.
val df2 = df.withColumn("max",max($"value"))
How to add the max column to the dataset?
There are 3 ways to do it (one you already know from the other answer). I avoid collect since it's not really needed.
Here is the dataset with the maximum value 3 appearing twice.
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3| <-- maximum value
|tg4| 4|
|tg3| 3| <-- another maximum value
+---+-----+
Cartesian Join With "Max" Dataset
I'm going to use a cartesian join of the tags and a single-row dataset with the maximum value.
val maxDF = tags.select(max("value") as "max")
scala> maxDF.show
+---+
|max|
+---+
| 4|
+---+
val solution = tags.crossJoin(maxDF)
scala> solution.show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
I'm not worried about the cartesian join here since it's just a single-row dataset.
Windowed Aggregation
My favorite windowed aggregation fits this problem so nicely. On the other hand, I don't really think that'd be the most effective approach due to the number of partitions in use, i.e. just 1, which gives the worst possible parallelism.
The trick is to use the aggregation function max over an empty window specification that informs Spark SQL to use all rows in any order.
val solution = tags.withColumn("max", max("value") over ())
scala> solution.show
18/05/31 21:59:40 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
Please note the warning that says it all.
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I would not use this approach given the other solutions and am leaving it here for educational purposes.
If you want the maximum value of a columns for all rows, you are going to need to compare all the rows in some form. That means doing an an aggregation. withColumn only operates on a single row so you have no way to get the DataFrame max value.
The easiest way to do this is like below:
val data = Seq(("a", 1), ("b", 2), ("c", 3), ("d", 4))
val df = sc.parallelize(data).toDF("name", "value")
// first is an action, so this will execute spark stages to compute the value
val maxValue = df.groupBy().agg(max($"value")).first.getInt(0)
// Now you can add it to your original DF
val updatedDF = df.withColumn("max", lit(maxValue))
updatedDF.show
There is also one alternative to this that might be a little faster. If you don't need the max value until the end of your processsing (after you have already run a spark action) you can compute it by writing your own Spark Acccumulator instead that gathers the value while doing whatever other Spark Action work you have requested.
Max column value as additional column by window function
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3|
|tg4| 4|
|tg3| 3|
+---+-----+
scala> tags.withColumn("max", max("value").over(Window.partitionBy(lit("1")))).show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+

Scala: Any better way to join two DataFrames by the relationship from the third one

I have to two DataFrames, and want to outer join them. But the joining mapping is in another dataframe.
Now I am using below way, it works, but I hope there is more efficient way for I have >1,000,000 rows
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
scala> ta.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
scala> tb.show
+---+---+
| C| D|
+---+---+
| 2| 1|
+---+---+
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
scala> tc.show
+---+---+---+
| D| E| F|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
scala> val tmp = ta.join(tb, Seq("C"), "left_outer")
tmp: org.apache.spark.sql.DataFrame = [C: int, A: int, B: int, D: int]
scala> tmp.show
+---+---+---+----+
| C| A| B| D|
+---+---+---+----+
| 1| 1| 1|null|
| 2| 2| 2| 1|
+---+---+---+----+
scala> tmp.join(tc, Seq("D"), "outer").show
+----+----+----+----+----+----+
| D| C| A| B| E| F|
+----+----+----+----+----+----+
|null| 1| 1| 1|null|null|
| 1| 2| 2| 2| 1| 1|
| 2|null|null|null| 2| 2|
+----+----+----+----+----+----+
As Umberto noted, a good reference on how to improve performance of your joins is Holden Karau and Rachel Warren's High Performance Spark > Chapter 4. Joins (SQL & Core).
From the standpoint of your code, running it as you noted or the SQL equivalent (as noted below) should result in about the same performance.
// Create initial tables
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
// _.createOrReplaceTempView
ta.createOrReplaceTempView("ta")
tb.createOrReplaceTempView("tb")
tc.createOrReplaceTempView("tc")
// SQL Query
spark.sql("
select tc.D, ta.A, ta.B, ta.C, tc.E, tc.F
from ta
left outer join tb
on tb.C = ta.C
full outer join tc
on tc.D = tb.D
")
The reason why is because the Spark SQL Catalyst Optimizer (as noted in the diagram below) takes the DataFrame query and builds up an optimized logical plan. A number of physical plans are developed and Spark SQL Engine's Cost Optimizer chooses the best physical plan and generates the code to produce the RDDs.
Saying this, the key concern is that when you're working with a lot of rows that use up a lot of memory, you have to take into account of the partitioning. For example, if you can ensure that the mapping DataFrame (tc) have the same / similar partitioning scheme as the other DataFrames (ta, tb) so that way you can have a co-located join (this is Figure 4-3 within High Performance Spark > Chapter 4. Join).
If the partitions for your three DataFrames (ta, tb, tc) all have different partitioning, this means the keys for your DataFrames will not have a 1-to-1 matching between the partitions. That is, this will result in a shuffle join (this is Figure 4-2 within High Performance Spark > Chapter 4. Join) which potentially could be more costly.
Basically, from the standpoint of your query, the concern is less about the query itself and more about the partitioning schemes for your DataFrames. But before experimenting too much with the partitioning schemes of your DataFrames, experiment with your queries to see if the default Spark SQL / DataFrame queries are able to take care of the partitioning by itself.