Stateful aggregation function in PySpark - scala

From PySpark, I am trying to define a custom aggregator that is accumulating state . Is it possible in Spark 2.3 ?
AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf with the PandasUDFType.GROUPED_AGG keyword. However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation.
From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction or org.apache.spark.sql.expressions.Aggregator , but is there a similar thing I can do on python-side only?

You could use an accumulator.
You could leverage spark streaming built-in state management.
simple accumulator example for use in SQL
from pyspark.sql.types import IntegerType
# have some data
df = spark.range(10).toDF("num")
# have a table
df.createOrReplaceTempView("num_table")
# have an accumulator
accSum = sc.accumulator(0)
# have a function that accumulates
def add_acc(int_val):
accSum.add(int_val)
return int_val
# register function as udf
spark.udf.register("reg_addacc", add_acc, IntegerType())
# use in sql
spark.sql("SELECT sum(reg_addacc(num)) FROM num_table").show()
# get value from accumulator
print(accSum.value)
45

Related

pyspark - difference between select and agg

What is the difference between the following two -
df.select(min("salary")).show()
and
df.agg({'salary':'min'}).show()
Also, what is the difference in these two -
df.groupBy("department").agg({'salary':'min'}).show()
and
df.groupBy("Company").min('salary').show()
In Spark, there are many different ways to write the same thing. It depends mostly if you prefer a SQL writting or a python writting.
df.select(min("salary")) is the equivalent of SQL :
select min(salary) from df
This query computes the min of the column salary without any group by clause.
It is equivalent to
from pyspark.sql import functions as F
df.groupBy().agg(F.min("salary"))
# OR
df.groupBy().agg({'salary':'min'})
As you can see, the groupBy is empty, so you do not group by anything. Python also can interpret the dict {'salary':'min'} which is equivalent to the function F.min("salary").
The method agg depends on the object. Applied to a Dataframe, it is equivalent to df.groupBy().agg. agg is also a method of the object GroupedData which is created when you do df.groupBy(). I added the link to the officiel doc where you can see the difference between the two methods.
When writting df.groupBy("department").agg({'salary':'min'}), you can specify in the method agg several different aggregation. When using just min, you are limited to one column. For example, you can do this :
from pyspark.sql import functions as F
df.groupBy("department").agg(F.min("salary"), F.max("age"))
# OR
df.groupBy("department").agg({'salary':'min', 'age':'max'})
# But you cannot do
df.groupBy("department").min("salary").max("age")
>> AttributeError: 'DataFrame' object has no attribute 'max'

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?
Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

Spark UDAF with distinct value

I'm trying to implement a custom UDAF in spark that needs a distinct values from the input.
does spark provides an interface which I can use to declare that I need the input as distinct values?
if so, how can this be implemented?

Spark Scala - Apply ML/Complex functions on a GroupBy DataFrame

I have a large DataFrame (Spark 1.6 Scala) which looks like this:
Type,Value1,Value2,Value3,...
--------------------------
A,11.4,2,3
A,82.0,1,2
A,53.8,3,4
B,31.0,4,5
B,22.6,5,6
B,43.1,6,7
B,11.0,7,8
C,22.1,8,9
C,3.2,9,1
C,13.1,2,3
From this I want to group by Type and apply machine learning algorithms and/or perform complex functions on each group.
My objective is perform complex functions on each group in parallel.
I have tried the following approaches:
Approach 1) Convert Dataframe to Dataset and then use ds.mapGroups() api. But this is giving me an Iterator of each group values.
If i want to perform RandomForestClassificationModel.transform(dataset: DataFrame), i need a DataFrame with only a particular group values.
I was not sure converting Iterator to a Dataframe within mapGroups is a good idea.
Approach 2) Distinct on Type, then map on them and then filter for each Type with in the map loop:
val types = df.select("Type").distinct()
val ff = types.map(row => {
val type = row.getString(0)
val thisGroupDF = df.filter(col("Type") == type)
// Apply complex functions on thisGroupDF
(type, predictedValue)
})
For some reason, the above is never completing (seems to be getting into some kind of infinite loop)
Approach 3) Exploring Window functions, but did not find a method which can provide dataframe of particular group values.
Please help.

Calling other methods/variables inside a UDF method in Spark SQL DataFrame

I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.