Iterable output type of Apache Beam GroupByKey.create() on FlinkRunner - apache-beam

The output of Apache-Beam GroupByKey.create() transformation is PCollection< KV< K,Iterable< V>>>.
When I run the code using FlinkRunner (batch mode), I see that the Iterable< V> is an ArrayList.
Does it mean that the grouped elements per key has to be fit into Memory?

Yes, I guess so. GroupByKey translation uses Combiner to combine all values with the same key and ArrayList is used as internal container for that. So, it could be a potential NPE issue with hot keys.
See details of implementation: one and two

Related

How can I apply different windows to one PCollection at once?

So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following operations like the GroupBy are the same, so I don't want to split up the PCollection at this point.
So what I'm trying to do is basically this:
windowed_items = (
items
| 'windowing' >> beam.WindowInto(window.SlidingWindows(lambda row: int(row.WINDOW_LENGTH), 60))
)
However, when building the pipeline I get the error TypeError: '<=' not supported between instances of 'function' and 'int'.
An alternative to applying different windows to one PCollection would be to split/branch the PCollection based on the defined window into multiple PCollections and apply the respective window to each. However, this would mean to hardcode the windowing for every allowed value, and in my case this is possibly a huge number which is why I want to avoid it.
So from the error I'm getting (but not being able to find it explicitely in the docs) I understand that the SlidingWindows parameters have to be provided when building the pipeline and cannot be determined at runtime. Is this correct? Is there some workaround how I can apply different windows to one PCollection at once or is it simply not possible? If that is the case, are there any other alternative approaches to the one I outlined above?
I believe that custom session windowing is what you are looking for. However, it's not supported in the Python SDK yet.

Creating attributes for elements of a PCollection in Apache Beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?
What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

Spark: Distributed removal/addition of elements in a set?

I am trying to convert a ML algorithm to Spark Scala to take advantage of my cluster's power. The relevant bits of pseudo-code are the following:
initialize set of elements
while(set not empty) {
while(...) { remove a given element from the set }
while(...) { add a given element to the set }
}
Is there any way to parallelize such a thing?
I would intuitively say that this is not implementable in a distributed fashion (the number of iterations being unknown), but I have been reading that Spark allows implementation of iterative ML algorithms.
Here is what I tried so far:
Originally used a mutable Set and removed/added elements during the loops in simple Scala. It runs correctly, but I feel like the whole code will just be executed on the driver which limits the interest of using Spark?
Made the set a RDD, and replaced the var during every iteration by a new RDD with subtracted/added element (which I suppose is super heavy?). No error appears but the variable doesn't actually get updated.
mySetRDD = mySetRDD.subtract(sc.parallelize(Seq(element)))
Looked up Accumulators for a way to keep a set of elements upated on its content (presence/absence of elements) across multiple executors, but they do not seem to allow things other than simple updates of numerical values.
Create PairRDD and then repartitionByKey say x partitions.
After that you can use
PairRdd1.zipPartition() to get the iterator over partition of rdds. Then you can write a function which will work over two iterators to produce third or output iterator.
Since you have repartition the rdd by key you need not keep track of the removals across partitions.
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#zipPartitions(org.apache.spark.rdd.RDD, boolean, scala.Function2, scala.reflect.ClassTag, scala.reflect.ClassTag)

Spark: difference of semantics between reduce and reduceByKey

In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function.
However, the method reduceByKey ONLY requires an associative binary function.
sc.textFile("file4kB", 4)
I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not?
Example, if a load some (small) text with 4 partitions (minimum):
val r = sc.textFile("file4k", 4)
then:
r.reduce(_ + _)
returns a string where parts are not always in the same order, whereas:
r.map(x => (1,x)).reduceByKey(_ + _).first
always returns the same string (where everything is in the same order than in the original file).
(I checked with r.glom and the file content is indeed spread over 4 partitions, there is no empty partition).
As far as I am concerned this is an error in the documentation and results you see are simply incidental. Practice, other resources and a simple analysis of the code show that function passed to reduceByKey should be not only associative but commutative as well.
practice - while it looks like the order is preserved in a local mode it is no longer true when you run Spark on a cluster, including standalone mode.
other resources - to quote Data Exploration Using Spark from AmpCamp 3:
There is a convenient method called reduceByKey in Spark for exactly this pattern. Note that the second argument to reduceByKey determines the number of reducers to use. By default, Spark assumes that the reduce function is commutative and associative and applies combiners on the mapper side.
code - reduceByKey is implemented using combineByKeyWithClassTag and creates ShuffledRDD. Since Spark doesn't guarantee the order after shuffling the only way to restore it would be to attach some metadata to the partially reduced records. As far as I can tell nothing like this takes place.
On a side note reduce as it is implemented in PySpark will work just fine with a function which is only commutative. It is of course just a detail of an implementation and not a part of the contract.
According to the code documentation, recently updated/corrected. (thanks #zero323) :
reduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
So it was in fact actually a documentation error like #zero323 pointed out in his answer.
You can check the following links to the code to make sure :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L304
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1560