Spark Structured Streaming - Custom aggregation with window time event - scala

I am trying to do custom aggregation on a structured streaming with event time windowing.
First I have tried to use #Aggregator interface (typed-UDAF) with the .agg function, something like :
val aggregatedDataset = streamDataset
.select($"id",$"time", $"eventType", $"eventValue"))
.groupBy(window($"time", "1 hour"), $"id").agg(CustomAggregator("time","eventType","eventValue").toColumn.as("aggregation"))
Yet this aggregation (in reduce function) is only working on the new input element, not the whole group
So I am trying to use the GroupState function (mapGroupsWithState, flapMapGroupWithState), or even just mapGroups function (without the state) to perform my aggregation
But my groupBy operation returns RelationalGroupedDataset and I need a KeyValueGroupedDataset to use map functions. groupByKey does not work with windowing.
How can I manage to do a custom aggregation with the structured streaming and timed event?
Thanks!

GroupState function(s) - mapGroupsWithState, flapMapGroupWithState, or mapGroups (without the state) are used to perform an aggregation only when we need to operate in Update output mode.
But if we are using Complete output mode then we do not need GroupState functions.
So, if you change the output mode of aggregatedDataset query to Complete, then it will work as expected.
I hope it helps!

Related

What is Right way to get spark executing udf

As fas as I know , the spark use lazy computation meaning if the action is not called, nothing would ever never happen .
And one way I know is using collect method get spark working , however when I read the article it says :
Usually, collect() is used to retrieve the action output when you have
very small result set and calling collect() on an RDD/DataFrame with a
bigger result set causes out of memory as it returns the entire
dataset (from all workers) to the driver hence we should avoid calling
collect() on a larger dataset.
And I actually have udf that returns NullType()
#udf
def write_something():
#write something to dir
so I do not want to use collect() ,cause it might cause OOM as mentioned above.
So in my case , what is the best way to do this in my case ? Thanks !
You can use Dataframe.foreach:
df.foreach(lambda x: None)
The foreach action will trigger the excecution of the whole DAG of df while keeping all data on their respective executors.
The pattern foreach(lambda x: None) is mainly used for debugging purposes. An option might be to remove the udf and put its logic into the function that is called by foreach.

Why flink can't support RichFunction on Reduce/Fold/Aggregate now?

I've searched RichAggregateFunction in github repository, just found below.
.aggregate() does not support [[RichAggregateFunction]], since the reduce function is used internally in a [[org.apache.flink.api.common.state.AggregatingState]].
Is that mean Flink can't merge elements' state in group window?
Depending on what you mean by "merge" here, you would generally do that work in the ProcessWindowFunction when you call stream.aggregate. The PWF would be your second parameter and it will receive the aggregation which you can perform additional operations on.
If you need to combine the aggregated elements together in some other way you can take the stream that comes out of the aggregate and do additional operations on them (such as a ProcessFunction).

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

Error : RDD transformations and actions can only be invoked by the driver, not inside of other transformations

I am trying to do nested operation inside a RDD transformation, and is throwing error.
Error : RDD transformations and actions can only be invoked by the driver, not inside of other transformations
I am using IndexedRDD to update from another RDD but not able to as IndexedRDD is updatable. Here is the code . How can I achieve this ?
for ((key, value) <- mapped1)
indexed = indexed.put(key, value)
As stated in the error, only the driver node is able to perform transformations and actions on RDDs. It seems that your code is likely executing on a worker node (e.g. inside of another transformation or action). It seems like you are trying to construct a union of the two RDDs, so I'd recommend looking at the union operator and seeing if it meets your needs.