Creating attributes for elements of a PCollection in Apache Beam - apache-beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?

What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

Related

How can an unbounded PCollection be immutable?

I am getting started in dataflow/apache beam, and I'm struggling to understand a concept. According to the documentation :
A PCollection is an immutable collection of values of type T. A PCollection can contain either a bounded or unbounded number of elements.
It is easy to understand that bounded PCollections are immutable. You get a file, you put it in a PCollection, you can't change it: Immutable.
What about unbounded PCollections? They are by definition, without a limit of number of elements, so stuff always gets added to them indefinitely; i.e. How can something be changed perpetually and also be immutable?
An explanation would be great.
That's a good question! I believe the Programming Guide explains PCollection's immutability better than the JavaDoc. The immutability has to do with individual elements:
A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. A Beam Transform might process each element of a PCollection and generate new pipeline data (as a new PCollection), but it does not consume or modify the original input collection.
Note: Beam SDKs avoid unnecessary copying of elements, so PCollection contents are logically immutable, not physically immutable. Changes to input elements may be visible to other DoFns executing within the same bundle, and may cause correctness issues. As a rule, it’s not safe to modify values provided to a DoFn.
Another way to look at it is that the set is logically immutable, it's just your view into it that's changing over time (due to the inability to see into the future). E.g. ReadFromPubSub returns the (immutable, unbounded) set of all message that will ever be published to this topic. From the Beam API you can't modify this set as a PCollection, but you can create other immutable, unbounded PCollections that are derived from it.
This is similar to lazy, infinite structures that exist in functional language like Haskell--you can only ever observe a portion of it, but that doesn't mean the whole thing doesn't exist as an immutable object.

How can I apply different windows to one PCollection at once?

So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following operations like the GroupBy are the same, so I don't want to split up the PCollection at this point.
So what I'm trying to do is basically this:
windowed_items = (
items
| 'windowing' >> beam.WindowInto(window.SlidingWindows(lambda row: int(row.WINDOW_LENGTH), 60))
)
However, when building the pipeline I get the error TypeError: '<=' not supported between instances of 'function' and 'int'.
An alternative to applying different windows to one PCollection would be to split/branch the PCollection based on the defined window into multiple PCollections and apply the respective window to each. However, this would mean to hardcode the windowing for every allowed value, and in my case this is possibly a huge number which is why I want to avoid it.
So from the error I'm getting (but not being able to find it explicitely in the docs) I understand that the SlidingWindows parameters have to be provided when building the pipeline and cannot be determined at runtime. Is this correct? Is there some workaround how I can apply different windows to one PCollection at once or is it simply not possible? If that is the case, are there any other alternative approaches to the one I outlined above?
I believe that custom session windowing is what you are looking for. However, it's not supported in the Python SDK yet.

Iterable output type of Apache Beam GroupByKey.create() on FlinkRunner

The output of Apache-Beam GroupByKey.create() transformation is PCollection< KV< K,Iterable< V>>>.
When I run the code using FlinkRunner (batch mode), I see that the Iterable< V> is an ArrayList.
Does it mean that the grouped elements per key has to be fit into Memory?
Yes, I guess so. GroupByKey translation uses Combiner to combine all values with the same key and ArrayList is used as internal container for that. So, it could be a potential NPE issue with hot keys.
See details of implementation: one and two

Reusing PCollection from output of a Transform in another Transform which is later stage of pipeline

In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?

scala queue sort method

I am comparing a number of different methods for organizing the nodes at the "frontier" in dijkstra's single source shortest path algorithm. One of the implementations that I am playing around with is using q: scala.collection.mutable.Queue.
Essentially, each time I add a node to q, I sort q. This method, as expected, takes significantly longer than using scala.collection.mutable.PriorityQueue and a MinHeap that I implemented. My question is, what kind of sort is Queue using when I call q.sorted? I am specifically interested in the time complexity of the sorted implementation.
I have tried looking at the API (http://www.scala-lang.org/api/2.10.2/index.html#scala.collection.mutable.Queue) and code (https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/mutable/Queue.scala#L1) but haven't been able to track this down.
Thank you in advance for your help.
Queue inherits sorted method from SeqLike. And you can see, that it creates new array of same elements, sorts array via java.util.Arrays.sort and then creates new structure of original type.