Does groupbykey functionality in Beam - apache-beam

We need to use groupbykey to achive the functionality
KV < ID , < List of objects > > like
KV <1, < obj1,obj2 >>
Could you please tell if groupbykey works in Spark runner. According to capability matrix it is not supported but according the following link it is supported.
https://issues.apache.org/jira/browse/BEAM-799
Thanks

If you press "expand details" on the capability matrix, you can see that GroupByKey is supported in the Spark runner with the note:
Partially: fully supported in batch mode
Using Spark's groupByKey. GroupByKey with multiple trigger firings in streaming mode is a work in progress.

Related

Apache Beam API to Runner instruction translation

Input: A sentence
Expected Output: String representation of an array generated by line.split(' ')
Transformation defined
.apply(
MapElements.into(TypeDescriptors.strings())
.via((String line) -> Collections.singletonList(line.split("[^\\p{L}]+")).toString()))
Question:
Does Beam translate the above instruction wherein I'm using a toString() to runner based implementations of toString? I want to avoid defining inadvertently a UDF that might cause subpar performance (I come from background in Spark, Pig) . I'm little hazy on how the translation happens between beam API and Runner instructions; appreciate any resources that throw light on the translation.
No, Beam runners should not update that function. MapElements is implemented using a Beam ParDo that will execute your function on incoming data as is. Beam runners may fuse multiple steps to create fused steps. Also the performance might depend on the Jvm used by the Beam runner.

scala api for delta lake optimize command

The databricks docs say that you can change zordering of a delta table by doing:
spark.read.table(connRandom)
.write.format("delta").saveAsTable(connZorder)
sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)")
The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is:
spark.read.table(connRandom)
.write.format("delta").saveAsTable(connZorder)
.optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port")
but I cant find any documentation that says that this is possible.
Is there a scala api for delta lake optimization commands? If so, how do I replicate the aforementioned logic in scala?

Spark Structured Streaming - Custom aggregation with window time event

I am trying to do custom aggregation on a structured streaming with event time windowing.
First I have tried to use #Aggregator interface (typed-UDAF) with the .agg function, something like :
val aggregatedDataset = streamDataset
.select($"id",$"time", $"eventType", $"eventValue"))
.groupBy(window($"time", "1 hour"), $"id").agg(CustomAggregator("time","eventType","eventValue").toColumn.as("aggregation"))
Yet this aggregation (in reduce function) is only working on the new input element, not the whole group
So I am trying to use the GroupState function (mapGroupsWithState, flapMapGroupWithState), or even just mapGroups function (without the state) to perform my aggregation
But my groupBy operation returns RelationalGroupedDataset and I need a KeyValueGroupedDataset to use map functions. groupByKey does not work with windowing.
How can I manage to do a custom aggregation with the structured streaming and timed event?
Thanks!
GroupState function(s) - mapGroupsWithState, flapMapGroupWithState, or mapGroups (without the state) are used to perform an aggregation only when we need to operate in Update output mode.
But if we are using Complete output mode then we do not need GroupState functions.
So, if you change the output mode of aggregatedDataset query to Complete, then it will work as expected.
I hope it helps!

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.

Spark: difference of semantics between reduce and reduceByKey

In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function.
However, the method reduceByKey ONLY requires an associative binary function.
sc.textFile("file4kB", 4)
I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not?
Example, if a load some (small) text with 4 partitions (minimum):
val r = sc.textFile("file4k", 4)
then:
r.reduce(_ + _)
returns a string where parts are not always in the same order, whereas:
r.map(x => (1,x)).reduceByKey(_ + _).first
always returns the same string (where everything is in the same order than in the original file).
(I checked with r.glom and the file content is indeed spread over 4 partitions, there is no empty partition).
As far as I am concerned this is an error in the documentation and results you see are simply incidental. Practice, other resources and a simple analysis of the code show that function passed to reduceByKey should be not only associative but commutative as well.
practice - while it looks like the order is preserved in a local mode it is no longer true when you run Spark on a cluster, including standalone mode.
other resources - to quote Data Exploration Using Spark from AmpCamp 3:
There is a convenient method called reduceByKey in Spark for exactly this pattern. Note that the second argument to reduceByKey determines the number of reducers to use. By default, Spark assumes that the reduce function is commutative and associative and applies combiners on the mapper side.
code - reduceByKey is implemented using combineByKeyWithClassTag and creates ShuffledRDD. Since Spark doesn't guarantee the order after shuffling the only way to restore it would be to attach some metadata to the partially reduced records. As far as I can tell nothing like this takes place.
On a side note reduce as it is implemented in PySpark will work just fine with a function which is only commutative. It is of course just a detail of an implementation and not a part of the contract.
According to the code documentation, recently updated/corrected. (thanks #zero323) :
reduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
So it was in fact actually a documentation error like #zero323 pointed out in his answer.
You can check the following links to the code to make sure :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L304
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1560