How does PCollection gets created by a runner - apache-beam

Similar to the code below get called internally from Read or GroupBy transform during expand. In terms of Beam code this will result in construction of an instance of PCollection. It is not apparent and clear what is actually being constructed by looking at the code as it is limited to just new operation. In terms of runner what does it mean by calling new PCollection(...)?
PCollection.createPrimitiveOutputInternal(
input.getPipeline(),
WindowingStrategy.globalDefault(),
IsBounded.BOUNDED,
ByteArrayCoder.of())

From the Apache Beam programming guide:
A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.
PCollection implements PValue, from its document:
Dataflow users should not construct PValue objects directly in their
pipelines.
Think it in this way: when using the SDK building a pipeline, you are constructing a directed acyclic graph of nodes of PTransforms and edges of PCollections. In the DAG, a PCollection instance is abstract and represents an input/output of PTransform[s]. When the DAG is executed on a runner, the data of each PCollection can reside on multiple machines/VMs/workers. You cannot view the data until you materialize them through some IO transforms.
If internally in the SDK, you see new PCollection(...), it builds the edge/node with necessary information that could later make sense to the runner when executing the DAG. A PCollection itself is not a data structure that holds data in memory.

Related

Creating attributes for elements of a PCollection in Apache Beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?
What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

Dynamic "Fan-In" for artifact outputs in Argo?

I have an Argo workflow with dynamic fan-out tasks that do some map operation (in a Map-Reduce meaning context). I want to create a reducer that aggregates their results. It's possible to do that when the outputs of each mapper are small and can be put as an output parameter. See this SO question-answer for the description of how to do it.
But how to aggregate output artifacts with Argo without writing custom logic of writing them to some storage in each mapper and read from it in reducer?
Artifacts are more difficult to aggregate than parameters.
Parameters are always text and are generally small. This makes it easy for Argo Workflows to aggregate them into a single JSON object which can then be consumed by a "reduce" step.
Artifacts, on the other hand, may be any type or size. So Argo Workflows is limited in how much it can help with aggregation.
The main relevant feature it provides is declarative repository write/read operations. You can specify, for example, an S3 prefix to write each parameter to. Then, in the reduce step, you can load everything from that prefix and perform your aggregation logic.
Argo Workflows provides a generic map/reduce example. But besides artifact writing/reading, you pretty much have to do the aggregation logic yourself.

Apache Beam API to Runner instruction translation

Input: A sentence
Expected Output: String representation of an array generated by line.split(' ')
Transformation defined
.apply(
MapElements.into(TypeDescriptors.strings())
.via((String line) -> Collections.singletonList(line.split("[^\\p{L}]+")).toString()))
Question:
Does Beam translate the above instruction wherein I'm using a toString() to runner based implementations of toString? I want to avoid defining inadvertently a UDF that might cause subpar performance (I come from background in Spark, Pig) . I'm little hazy on how the translation happens between beam API and Runner instructions; appreciate any resources that throw light on the translation.
No, Beam runners should not update that function. MapElements is implemented using a Beam ParDo that will execute your function on incoming data as is. Beam runners may fuse multiple steps to create fused steps. Also the performance might depend on the Jvm used by the Beam runner.

Iterable output type of Apache Beam GroupByKey.create() on FlinkRunner

The output of Apache-Beam GroupByKey.create() transformation is PCollection< KV< K,Iterable< V>>>.
When I run the code using FlinkRunner (batch mode), I see that the Iterable< V> is an ArrayList.
Does it mean that the grouped elements per key has to be fit into Memory?
Yes, I guess so. GroupByKey translation uses Combiner to combine all values with the same key and ArrayList is used as internal container for that. So, it could be a potential NPE issue with hot keys.
See details of implementation: one and two

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.